Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Ik wil het vandaag hebben over een krachtig en fundamenteel aspect van wie we zijn: onze stem. Ieder van ons heeft een unieke stemafdruk die onze leeftijd en afmetingen weerspiegelt en zelfs onze manier van leven en persoonlijkheid. Met de woorden van de dichter Longfellow: "De menselijke stem is het orgaan van de ziel." Als spraakwetenschapper ben ik gefascineerd door hoe de stem wordt geproduceerd. Ik heb een idee om een stem te construeren. Dat wil ik graag aan jullie vertellen.

I'm going to start by playing you a sample of a voice that you may recognize.

Ik zal beginnen met een stukje te laten horen van een stem die je misschien herkent.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Opname) Stephen Hawking: "Ik dacht dat het vrij duidelijk was wat ik bedoelde."

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Pupal Patel: Dat was de stem van Professor Stephen Hawking. Wat je misschien niet weet, is dat dezelfde stem misschien ook wordt gebruikt door dit meisje dat niet kan praten vanwege een neurologische aandoening. Misschien gebruiken al deze mensen wel dezelfde stem, want er zijn maar een paar opties beschikbaar. In de VS alleen, zijn er 2,5 miljoen Amerikanen die niet kunnen praten. Velen van hen gebruiken spraakcomputers om te communiceren. Dat zijn dus miljoenen mensen wereldwijd die generieke stemmen gebruiken, inclusief Professor Hawking, die een stem met Amerikaans accent bezigt. Dit gebrek aan individualiteit van de synthetische stem drong pas echt tot me door op een conferentie over hulpmiddelen voor gehandicapten, een paar jaar geleden. Ik weet nog dat ik een zaal inliep en een meisje en een volwassen man met elkaar zag praten met hun apparaten, verschillende apparaten, maar dezelfde stem. Toen keek ik rond en zag dit overal om me heen: echt honderden mensen die een handjevol stemmen bezigden. Stemmen die niet pasten bij hun lichaam of hun persoonlijkheid. We zouden er niet aan moeten denken om een klein meisje de prothese van een volwassen man aan te meten. Waarom dan wel dezelfde stemprothese? Het raakte me echt. Ik wilde hier iets aan doen.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Ik zal jullie nu een stukje laten horen van iemand die -- of eigenlijk van twee mensen die een ernstige spraakhandicap hebben. Ik wil dat je luistert naar hoe ze klinken. Ze uiten hetzelfde.

(First voice)

(Eerste stem)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Tweede stem) Je begreep waarschijnlijk niet wat ze zeiden, maar ik hoop dat je hun unieke stemidentiteit hoorde.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

Vervolgens wilde ik uitvinden hoe we de resterende stemvermogens zouden kunnen vangen om een technologie te bouwen die gepersonaliseerd kan worden, gepersonaliseerde stemmen. Dus vroeg ik steun aan Tim Bunnel, met wie ik werk. Dr. Bunnell is een expert in spraaksynthese. Hij heeft gewerkt aan het bouwen van gepersonaliseerde stemmen voor mensen met behulp van vooraf opgenomen fragmenten van hun stem. Daarmee reconstrueert hij stemmen voor hen. Dat zijn mensen die hun stem zijn verloren op latere leeftijd. Wij hadden niet de luxe van vooraf opgenomen stemfragmenten voor degenen met aangeboren spraakstoornissen. Maar ik dacht dat er een manier moest zijn om een stem te construeren uit wat ervan overblijft, hoe weinig dat ook is. Dus besloten we precies dat te doen.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

We begonnen met wat subsidie van de National Science Foundation om gepersonaliseerde stemmen te maken met behoud van hun unieke vocale identiteiten. We noemen dit project VocaliD, of vocaal ID, voor vocale identiteit.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Voordat ik in detail treed over hoe de stem wordt gemaakt en hem laat horen, moet ik jullie heel even spraakwetenschapsles geven, oké? Allereerst weten we dat de stem gedurende de ontwikkeling dramatisch verandert. Kinderen klinken anders dan tieners die anders klinken dan volwassenen. Dat hebben we allemaal ervaren. Ten tweede is spraak een combinatie van de bron, wat de trillingen van je strottenhoofd zijn, die dan verder worden geduwd door de rest van het spraakkanaal. Dat zijn de holten in je hoofd en nek die vibreren. Die filteren dat brongeluid om klinkers en medeklinkers te produceren. Dus de combinatie van bron en filter is hoe we spraak produceren. Dat gebeurt in één individu.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Ik vertelde al eerder dat ik een groot deel van mijn carrière heb gewijd aan het begrijpen en bestuderen van de bronkarakteristieken van mensen met ernstige spraakstoornissen. Ik heb vastgesteld dat, al waren hun filters verstoord, ze in staat waren om hun bron te moduleren: de toon, het volume, het tempo van hun stem. Dat heet de prosodie. Ik heb jarenlang gedocumenteerd dat deze individuen in staat blijven tot prosodie. Dus toen ik me realiseerde dat diezelfde hints ook belangrijk zijn voor de identiteit van de spreker, kreeg ik dit idee. Waarom nemen we niet de bron van degene waarop het stemgeluid moet lijken, omdat die bewaard is, en lenen we het filter van iemand die ongeveer van dezelfde leeftijd en grootte, die spraak kan articuleren en mengen we die vervolgens? Want als we ze mengen, kunnen we net zo'n heldere stem krijgen als onze surrogaat-spreker -- degene van wie we het filter lenen -- met een identiteit die vergelijkbaar is met onze doelspreker. Zo simpel is het. Dat is de wetenschap achter wat we doen. Als je dat eenmaal in gedachten hebt,

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

hoe construeer je dan die stem? Nou, je moet iemand vinden die surrogaat wil zijn. Dat is niet zo verontrustend. Om surrogaatdonor te zijn, hoef je alleen maar een paar honderd tot een paar duizend uitingen te doen. Het proces gaat ongeveer zo. (Video) Stem: Dingen gebeuren in paren.

(Video) Voice: Things happen in pairs.

Ik hou van slapen.

I love to sleep.

De hemel is blauw zonder wolken.

The sky is blue without clouds.

RP: Zo gaat ze een uur of drie door,

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

RP: Zo gaat ze een uur of drie door, en het idee is niet dat ze alles moet zeggen wat het doel zal willen zeggen, maar het idee is om alle verschillende combinaties van geluiden die in de taal voorkomen te vormen. Hoe meer spraak je hebt, hoe beter de resulterende stem zal klinken. Als je die opnames eenmaal hebt, moeten we de opnames ontleden in kleine spraaksnippertjes, geluiden of combinaties van twee geluiden, soms zelfs hele woorden die een gegevensverzameling of -bestand zullen beginnen te vormen. We zullen dat gegevensbestand een stembank noemen. De kracht van de stembank is dat we vanuit deze stembank nu willekeurige nieuwe uitingen kunnen doen zoals: "Ik hou van chocola" -- iedereen moet dat kunnen zeggen -- we vissen in het gegevensbestand en vinden alle benodigde segmenten om die uiting te doen. (Video) Stem: Ik hou van chocola.

(Video) Voice: I love chocolate.

RP: Dus dat is spraaksynthese.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

Dat heet concatenatieve synthese en dat doen we. Dat is niet het nieuwe stuk. Het nieuwe is hoe we het laten klinken als deze jonge vrouw. Dit is Samantha.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

Ik heb haar ontmoet toen ze negen was, en sindsdien proberen mijn team en ik voor haar een gepersonaliseerde stem te construeren. We moesten eerst een surrogaatdonor voor haar vinden. Daarna moesten we Samantha wat geluiden laten maken. Zij kan vooral klinkerachtige klanken maken, maar dat is genoeg voor ons om haar bronkarakteristieken te extraheren. Mijn dochter van zes beschrijft wat dan gebeurt, met een mooie vergelijking. Ze noemt het kleuren mengen om stemmen te schilderen. Dat is prachtig. Het is precies zo. Samantha's stem is net onverdunde rode kleurstof die we kunnen laten intrekken in de opnames van haar surrogaat om een roze stem zoals dit te krijgen.

(Video) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

RP: Dus nu kan Samantha dit zeggen.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) Samantha: Deze stem is alleen voor mij. Ik kan niet wachten om mijn nieuwe stem met mijn vrienden te gebruiken.

RP: Thank you. (Applause)

RP: (Applaus) Bedankt.

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Ik zal nooit de zachte glimlach vergeten die over haar gezicht kwam toen ze die stem voor het eerst hoorde. Nu zijn er miljoenen mensen ter wereld zoals Samantha, miljoenen. En we hebben nog maar een krasje op het oppervlak gemaakt. Tot nu toe hebben we een aantal surrogaatsprekers uit de VS die hun stem hebben gedoneerd. Die hebben we gebruikt om onze eerste paar aangepaste stemmen te construeren. Maar er is zoveel meer werk te doen. Voor Samantha kwam haar surrogaat ergens uit het Midwesten, een vreemdeling die haar de gave van haar stem schonk. Als wetenschapper vind ik het zo spannend om dit werk uit het laboratorium te halen en eindelijk de echte wereld in te nemen zodat het daadwerkelijk impact kan hebben. Wat ik nu met jullie wil delen, is hoe ik me voorstel dat we dit werk naar het volgende niveau tillen. Ik stel me een hele wereld van surrogaatdonoren voor van allerlei pluimage, verschillende maten en leeftijden, die allemaal samenkomen in deze stemopslag om mensen stemmen te geven die net zo kleurig zijn als hun persoonlijkheid. Om dat te doen, hebben we als eerste stap een website samengesteld, VocaliD.org, als verzamelpunt voor diegenen die zich als stemdonoren bij ons willen voegen of als expertisedonoren om deze visie op welke manier ook realiteit te laten worden.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Ze zeggen dat bloed geven levens kan redden. Welnu, je stem geven kan levens veranderen. Het enige dat we nodig hebben, zijn een paar uren spraak van onze surrogaatspreker, en maar een klinker van onze doelspreker om een unieke stemidentiteit te creëren.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

Dus dat is de wetenschap achter wat we doen. Ik wil afsluiten door terug te komen op de menselijke kant die de eigenlijke inspiratie is voor dit werk. Ongeveer vijf jaar geleden construeerden we onze allereerste stem voor een jongetje dat William heet. Toen zijn moeder deze stem voor het eerst hoorde, zei ze: "Dit is hoe William zou hebben geklonken als hij kon praten." En toen zag ik William een bericht typen op zijn apparaat. Ik vroeg me af wat hij dacht. Stel je voor dat je negen jaar lang rondloopt met andermans stem en eindelijk je eigen stem tegenkomt. Stel je voor. Dit is wat William zei:

This is what William said: "Never heard me before."

"Nooit eerder mezelf gehoord."

Thank you.

Bedankt.

(Applause)

(Applaus)

I'm going to start by playing you a sample of a voice that you may recognize.

Ik zal beginnen met een stukje te laten horen van een stem die je misschien herkent.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Opname) Stephen Hawking: "Ik dacht dat het vrij duidelijk was wat ik bedoelde."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Ik zal jullie nu een stukje laten horen van iemand die -- of eigenlijk van twee mensen die een ernstige spraakhandicap hebben. Ik wil dat je luistert naar hoe ze klinken. Ze uiten hetzelfde.

(First voice)

(Eerste stem)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Tweede stem) Je begreep waarschijnlijk niet wat ze zeiden, maar ik hoop dat je hun unieke stemidentiteit hoorde.

(Video) Voice: Things happen in pairs.

Ik hou van slapen.

I love to sleep.

De hemel is blauw zonder wolken.

The sky is blue without clouds.

RP: Zo gaat ze een uur of drie door,

(Video) Voice: I love chocolate.

RP: Dus dat is spraaksynthese.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

Dat heet concatenatieve synthese en dat doen we. Dat is niet het nieuwe stuk. Het nieuwe is hoe we het laten klinken als deze jonge vrouw. Dit is Samantha.

(Video) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

RP: Dus nu kan Samantha dit zeggen.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) Samantha: Deze stem is alleen voor mij. Ik kan niet wachten om mijn nieuwe stem met mijn vrienden te gebruiken.

RP: Thank you. (Applause)

RP: (Applaus) Bedankt.

This is what William said: "Never heard me before."

"Nooit eerder mezelf gehoord."

Thank you.

Bedankt.

(Applause)

(Applaus)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice