Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Oggi vorrei parlarvi di un aspetto importante e fondamentale del nostro modo di essere: la nostra voce. Ognuno di noi ha un timbro di voce unico che riflette la nostra età, corporatura, persino lo stile di vita e la personalità. Secondo le parole del poeta Longfellow, "la voce umana è l'organo dell'anima." Da scienziata della parola, mi affascina il modo in cui viene prodotta la voce, e ho un'idea su come la si possa progettare. E vorrei condividerla con voi.

I'm going to start by playing you a sample of a voice that you may recognize.

Vorrei iniziare mostrandovi il campione di una voce che forse riconoscerete.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Audio) Stephen Hawking: "Pensavo che fosse abbastanza ovvio ciò che intendevo."

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Rupa Patel: Era la voce del Professor Stephen Hawking. Ciò che forse non saprete è che la stessa voce può essere usata anche da questa ragazzina che non può parlare a causa di una patologia neurologica. È probabile, infatti, che tutte queste persone stiano utilizzando la stessa voce, poiché le opzioni disponibili sono ben poche. Solo negli USA, sono 2.5 milioni gli americani che non possono parlare, molti dei quali ricorrono a un sistema computerizzato per comunicare. A livello mondiale, sono milioni le persone che usano voci generiche, incluso il Professor Hawking, che usa una voce con accento americano. L'assenza di personalizzazione della voce sintetizzata mi ha davvero colpito mentre ero a una conferenza sulle tecnologie assistive alcuni anni fa, e ricordo che stavo entrando in un padiglione e ho visto una ragazzina e un uomo che parlavano attraverso i loro dispositivi, diversi ma con la stessa voce. Mi sono guardata intorno e ho visto che succedeva lo stesso attorno a me: centinaia di persone che usavano appena un paio di voci, non adatte ai loro corpi o alle loro personalità. Non immagineremmo mai una ragazzina con la stessa protesi di un uomo. Perché, invece, deve usare la stessa voce protesica? Mi ha davvero colpito, e ho voluto fare qualcosa.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Ora ascolterete un campione di qualcuno che ha, in realtà due persone, che hanno gravi disturbi del linguaggio. Vorrei che prestiate ascolto a come li percepite. Dicono lo stesso enunciato.

(First voice)

(Prima voce)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Seconda voce) Forse non avrete capito cosa hanno detto, ma spero abbiate notato le loro identità vocali individuali.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

Quello che vorrei fare, adesso, è scoprire come possiamo impiegare queste capacità vocali residue e costruire una tecnologia su misura per loro, voci che possano essere personalizzate su misura. Mi sono messa in contatto col mio collaboratore, Tim Bunnell. Il dott. Bunnell è un esperto di sintesi vocale, e sta costruendo voci personalizzate per i suoi pazienti, unendo insieme campioni delle loro voci preregistrate e ricostruendo loro una voce. Queste sono persone che hanno perso la loro voce dopo la nascita. Non abbiamo il lusso di campioni vocali preregistrati per chi è nato, con problemi alla voce. Ma ho pensato che ci dovesse essere un modo di progettare una voce partendo da quel poco che ne resta.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

Ed è proprio quello che abbiamo deciso di fare. Siamo partiti con alcuni finanziamenti della Fondazione Nazionale per le Scienze, per creare voci personalizzate che catturino l'unicità della loro impronta vocale. Abbiamo creato questo progetto, VocaliD o vocal I.D. per l'identità vocale.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Prima di entrare nel dettaglio di come venga prodotta la voce e di come la sentiamo, devo farvi una rapida lezione di scienze. Va bene? Per iniziare, sappiamo che la voce cambia radicalmente, nel corso dello sviluppo. I bambini hanno una voce diversa dagli adolescenti, che hanno una voce diversa dagli adulti. Ci siamo passati tutti. Secondo, la parola è una combinazione della fonte, che è la vibrazione generata dalla vostra scatola vocale, che viene successivamente spinta attraverso il resto del tratto vocale. Queste sono le camere della vostra testa e collo che vibrano, e in realtà filtrano il suono sorgente per produrre consonanti e vocali. La combinazione di fonte e filtro, poi, produce la parola. È questo che accade nelle persone.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Vi ho raccontato che ho passato buona parte della mia carriera cercando di capire e studiare la caratteristiche delle fonti delle persone con seri disturbi del linguaggio, e ho scoperto che anche attraverso i loro filtri danneggiati erano in grado di modulare la propria fonte: il tono, il volume, il ritmo della loro voce. Si chiama prosodia, e ho documentato per anni che le abilità prosodiche di questi individui sono intatte. Così, quando ho capito che quegli stessi spunti erano importanti per l'identità del parlante ho avuto questa idea: perché non prendere la fonte dalle persone di cui volevamo riprodurre la voce visto che era intatta e prendere in prestito i filtri da qualcuno della stessa età e corporatura, in grado di articolare suoni e mescolarli? Perché quando li mescoliamo, possiamo ottenere una voce chiara come quella del nostro parlante-surrogato (che è la persona da cui prendiamo in prestito i filtri) che ha un'identità simile al nostro parlante-obiettivo. È tutto qui. Questa è la scienza dietro quello che facciamo.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

Così, una volta che avete questo in mente, come si procede nel costruire questa voce? Bisogna trovare qualcuno disponibile a essere un surrogato. Non è una cosa così terribile. Fare il donatore surrogato richiede semplicemente dover dire dal centinaio al migliaio di parole. Il tutto si svolge più o meno così.

(Video) Voice: Things happen in pairs.

Voce: Le cose capitano in coppia.

I love to sleep.

Amo dormire.

The sky is blue without clouds.

Il cielo è blu senza nuvole.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

Andrà avanti così per circa tre o quattro ore. L'idea non è che lei dica tutto quello che il soggetto potrebbe voler dire, l'idea è quella di coprire tutte le varie combinazioni di suoni che ricorrono nel linguaggio. Più parole ottieni, migliore sarà il suono della voce che otterrai. Ottenute queste registrazioni, quel che occorre fare è analizzare queste registrazioni in piccoli frammenti di parlato, una o due combinazioni di suoni, talvolta persino intere parole che iniziano a popolare un set di dati o un database. Chiameremo questo database banca vocale. Il potere di questa banca vocale è che da questa banca vocale potremo articolare ogni nuova parola. tipo "mi piace il cioccolato", dobbiamo tutti essere in grado di dirlo, pescare in questo database e trovare tutti i segmenti necessari per pronunciare quell'espressione.

(Video) Voice: I love chocolate.

(Video) Voce: Mi piace il cioccolato.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Ecco, questa è la sintesi vocale. Si chiama sintesi concatenativa, ed è ciò che stiamo utilizzando. Ma non è questa la vera novità. La novità è come la utilizziamo per farla sembrare la voce di questa ragazza.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

Lei è Samantha. L'ho incontrata quando aveva nove anni, e da allora la mia squadra e io stiamo cercando di costruirle una voce personalizzata. Prima di tutto abbiamo dovuto trovare un donatore surrogato, poi Samantha ha dovuto produrre qualche frase. Era in grado di produrre principalmente suoni vocalici, ma a noi bastava per estrarre le caratteristiche della sua fonte. Per descrivere al meglio quel che accadde dopo, userò un'analogia di mia figlia di sei anni. Lei lo definisce "mescolare i colori per dipingere la voce". È una bella metafora. Ed è proprio così. La voce di Samantha è come un campione concentrato di colorante alimentare rosso che possiamo infondere nelle registrazioni del suo surrogato per ottenere una voce rosa proprio come questa.

(Video) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

Adesso Samantha può dire una cosa come questa:

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) Samantha: Questa voce è su misura per me. Non vedo l'ora di usare la mia voce con i miei amici.

RP: Thank you. (Applause)

RP: Grazie.

(Applausi)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Non dimenticherò mai il sorriso gentile che si è allargato sul suo viso quando ha sentito questa voce per la prima volta. Ci sono milioni di persone al mondo come Samantha, milioni, e noi abbiamo appena iniziato a scalfire la superficie. Per ora abbiamo alcuni parlanti surrogati provenienti dagli Stati Uniti che hanno donato la loro voce e li stiamo usando per costruire le prime voci personalizzate. C'è ancora molto lavoro, da fare. Il surrogato di Samantha proviene da qualche parte del Midwest, una sconosciuta che le ha regalato la voce. Da scienziata sono così eccitata di portare questo lavoro fuori dal laboratorio finalmente nel mondo reale così da poter avere un vero impatto nel mondo. Quel che voglio condividere con voi adesso è come immagino di portare questo lavoro al prossimo livello. Mi immagino un intero mondo di donatori surrogati di tutti i ceti sociali, di corporature diverse, di età diverse, convergere in questo archivio di voci per dare alle persone voci colorate come le loro personalità. Per procedere al passo successivo abbiamo realizzato questo sito, VocaliD.org, in modo da raccogliere le adesioni di quelli che si vogliono unire a noi come donatori di voce, come volontari esperti in qualunque modo, per rendere questa visione realtà.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Si dice che donare il sangue possa salvare una vita. Donare la vostra voce può cambiarla, una vita. Tutto quel che ci serve è qualche ora di parlato di un parlante surrogato e qualche vocale dal nostro parlante obiettivo per creare un'identità vocale unica.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

Questa è la scienza dietro quel che facciamo. Voglio terminare tornando al lato umano, che è quello che ispira veramente questo lavoro. Circa cinque anni fa, abbiamo costruito la nostra primissima voce per un ragazzino di nome William. Quando sua madre ha sentito per la prima volta questa voce ha detto "Questa è la voce che avrebbe avuto William se fosse stato in grado di parlare". Quindi ho visto William digitare un messaggio sul suo dispositivo. Mi domandavo cosa stesse pensando. Immaginate di parlare con la voce di qualcun altro per nove anni e finalmente di trovare la vostra voce. Immaginatelo.

This is what William said: "Never heard me before."

Questo è quel che William ha scritto: "Non avevo mai sentito <i>me stesso</i>, prima d'ora".

Thank you.

Grazie.

(Applause)

(Applauso)

I'm going to start by playing you a sample of a voice that you may recognize.

Vorrei iniziare mostrandovi il campione di una voce che forse riconoscerete.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Audio) Stephen Hawking: "Pensavo che fosse abbastanza ovvio ciò che intendevo."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Ora ascolterete un campione di qualcuno che ha, in realtà due persone, che hanno gravi disturbi del linguaggio. Vorrei che prestiate ascolto a come li percepite. Dicono lo stesso enunciato.

(First voice)

(Prima voce)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Seconda voce) Forse non avrete capito cosa hanno detto, ma spero abbiate notato le loro identità vocali individuali.

(Video) Voice: Things happen in pairs.

Voce: Le cose capitano in coppia.

I love to sleep.

Amo dormire.

The sky is blue without clouds.

Il cielo è blu senza nuvole.

(Video) Voice: I love chocolate.

(Video) Voce: Mi piace il cioccolato.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

(Video) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

Adesso Samantha può dire una cosa come questa:

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) Samantha: Questa voce è su misura per me. Non vedo l'ora di usare la mia voce con i miei amici.

RP: Thank you. (Applause)

RP: Grazie.

(Applausi)

This is what William said: "Never heard me before."

Questo è quel che William ha scritto: "Non avevo mai sentito <i>me stesso</i>, prima d'ora".

Thank you.

Grazie.

(Applause)

(Applauso)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice