Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Sot dua t'ju flas per nje aspekt shume te rendesishem te identitetit tone: zerin. Cdonjeri prej nesh ka nje ze unik qe pasqyron moshen tone, dimensionet tona, madje edhe stilin e jetes apo personalitetin. Madje sipas poetit Longfellow, "zeri i njeriut eshte organ i shpirtit te tij". Si shkencetare e zerit, mu me mahnit menyra se si prodhohet zeri, madje kam nje ide se si mund realizoj ate. Kjo eshte ajo qe dua te ndaj me ju.

I'm going to start by playing you a sample of a voice that you may recognize.

Do filloj duke ju dhene te degjoni nje ze qe ndoshta ju do t'a njihni.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(regjistrim) Stephen Hawking: "Mendoja se ajo qe doja te thoja ishte mjaft e qarte."

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Rupal Patel: Ky ishte zeri i profesor Stephen Hawking. Ajo qe ndoshta nuk dini eshte se po i njejti ze mund t'i jepet kesaj vajze te vogel e cila e ka te pamundur te flase per shkak te nje problemi neurologjik. Ne te vertete, secili prej ketyre individeve mund te perdore te njejtin ze, dhe kjo ndodh sepse alternativat jane te pakta. Vetem ne SHBA ka 2.5 milione amerikane qe nuk mund te flasin, shumica prej te cileve perdorin pajisje te kompjuterizuara per te komunikuar. Ne bote ka miliona njerez qe perdorin zerat e zakonshem, duke perfshire ketu profesor Hawking, i cili perdor nje ze me theks amerikan. Mungesa e personalizimit te zerave sintetike me zgjoi interes kur mora pjese ne nje konference mbi teknologjite ndihmuese disa vite me pare, dhe mbaj mend se duke ecur ne holl pashe nje vajze te vogel dhe nje burre te cilet po bisedonin duke perdorur pajisjet e tyre, dy pajisje te ndryshme qe leshonin te njejtin ze. Pashe perreth dhe kuptova se kjo po ndodhte kudo, qindra individe qe perdornin nje numer te vogel zerash, zera qe nuk i pershtateshin trupave apo personaliteteve te tyre. Ne nuk do te donim ti jepnim nje vajze te vogel krahun proteze te nje burri te rritur. Atehere pse duhet te kene ata te njejtin ze sintetik? Kjo gje me goditi fort, dhe doja te beja dicka ne lidhje me kete.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Tani do t'ju tregoj nje rregjistrim te nje personi, ne fakt jane dy persona, me crregullime te renda ne te folur. Dua qe te degjoni tingullin e zerit te tyre. Ata thone te njejten gje.

(First voice)

(zeri i pare)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(zeri i dyte) Ndoshta nuk arritet te kuptonit se c'thone, por shpresoj te keni dalluar identitetin unik te zerit te tyre.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

Keshtu, ajo qe vendosa te beja ishte te kuptoja se si mund t'i perdorja keto mbetje te zerit per te krijuar nje teknologji qe mund te personalizohej per ta, zera qe mund te personalizoheshin per ta. Ja pse iu drejtova bashkepunetorit tim, Tim Bunnell. Dr. Bunnell eshte ekspert ne sintetizimin e zerit, dhe ai ka punuar per te krijuar zera te personalizuar duke punuar mbi rregjistrime te meparshme te zerit te tyre dhe duke ndertuar per ta nje ze te ri. Keta jane njerez qe e kane humbur zerin me vone ne jeten e tyre. Por ne nuk e kishim luksin e rregjistrimeve ekzistuese per ata qe kishin lindur me probleme ne te follur. Por une mendova se duhet te gjendej nje menyre per te rindertuar nje ze nga ato pak tinguj qe mbeteshin prej tij.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

Keshtu, ne vendosem te benim pikerisht kete. Me shume pak fonde nga Fondacioni Kombetar i Shkences, ne filluam te krijonim zera te vecante qe percillnin identitetin vokal unik te tyre. E kemi quajtur kete projekt VocaliD, ose vocal ID, qe do te thote identitet vokal.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Para se te hyj ne detajet mbi menyren se si eshte i ndertuar zeri dhe para se tjua tregoj ate, Me duhet t'ju jap nje mesim shume te shkurter shkencor mbi zerin. Mire? Se pari, ne e dime se zeri ndryshon jashtezakonisht shume gjate rritjes. Femijet kane ze tjeter nga adoleshentet, te cilet tingellojne ndryshe nga te rriturit. Kete e kemi vene re qe te gjithe. E dhena numer dy eshte se e folura vjen si kombinim i burimit, pra vibrimeve qe gjenerohen nga kutia juaj e zerit, te cilat me pas shtyhen drejt pjeses se mbetur te traktit zanor. Ato qe dridhen jane dhomat e kokes dhe qafes tuaj, dhe ato e filtrojne ate tingull fillestar per te krijuar bashtingellore dhe zanore. Pra eshte fale kombinimit te burimit dhe filtrave qe prodhohet zeri. Kjo ndodh brenda nje individi te vetem.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Ju thashe me pare se une kam kaluar pjesen me te madhe te karrieres duke kuptuar dhe studiuar karakteristikat e burimit tek njerez me crregullime te medha ne te folur, dhe ajo qe zbulova ishte se, edhe pse filtrat e tyre ishin difektoze, ata ishin ne gjendje te modifikonin burimin: lartesine, fortesine, dhe ritmin e zerit te tyre. Keto quhen prozodi, dhe prej vitesh une kam dokumentuar se si aftesite prozodike te ketyre njerezve ruhen gjate kohes. Kur kuptova se po te njejtat elemente jane te rendesishme per identitetin zanor, me erdhi nje ide. Pse nuk marrim burimin nga personi te cilin duam t'i ngjaje zeri, duke qene se nuk ka demtime, dhe marrim hua filtrin e nje personi me te njejten moshe dhe peshe, duke qene se mund te artikulojne, dhe pastaj t'i perziejme ato? Sepse kur i perziejme ato, na jepet nje ze po aq i qarte sa ai i folesit surrogat -- ky eshte personi te cilit i morrem filtrat hua -- dhe i ngjan identitetit te folesit tone te synuar. Kaq e thjeshte. Kjo eshte shkenca qe fshihet pas asaj qe po bejme.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

Duke pasur parasysh kete, si ndertohet zeri? Epo, si fillim duhet te gjesh dike qe eshte i gatshem te behet surrogato. Nuk eshte aq veprim i mbrapshte. Te behesh dhurues surrogato do te thote thjesht te thuash nga disa qindra ne disa mijera fjale. Procesi zhvillohet pak a shume keshtu.

(Video) Voice: Things happen in pairs.

(video) Zeri: Gjerat behen ne cift.

I love to sleep.

Mua me pelqen te fle.

The sky is blue without clouds.

Qielli eshte blu dhe pa re.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

RP: Ajo do te vazhdoje me kete ritem per tre apo kater ore, dhe ideja nuk eshte qe ajo te thote gjithcka qe personi do doje te thoje, por qe te mbuloje te gjitha kombinimet e mundshme te tingullit qe vihen re ne nje gjuhe. Sa me i gjate te jete ky rregjistrim, aq me mire do te tingelloje zeri i personit marres. Pasi marrim keto rregjistrime, ajo qe ne bejme eshte te ndajme keto rregjistrime ne fragmente te shkurtra, ne kombinime me nje apo dy tinguj, ndonjehere edhe fjale te plota me te cilat nisim popullimin e nje baze te dhenash. Kete baze te dhenash do t'a quajme banka e zerave. Fuqia e bankes se zerave eshte se nga kjo banke zerash, ne mund te themi fjali te reja, si per shembull "Mua me pelqejne cokollatat"... te gjithe duhet t'a kene mundesine per t'a thene kete-- ne peshkojme ne bazen e te dhenave dhe gjejme segmentet e nevojshme per te krijuar ate fjali.

(Video) Voice: I love chocolate.

(Video) Zeri: Mua me pelqejne cokollatat.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Pra keshte sintetizohet e folura. Quhet sintetizim konkatenativ, dhe eshte ai qe ne perdorim. Por nuk eshte kjo risia. Risia gjendet ne tingullin qe ne i japim si ne rastin e kesaj vajze te re.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

Quhet Samanta. Une e kam takuar ku ishte nente vjec, dhe qe atehere, skuadra ime dhe une kemi punuar per t'i ndertuar nje ze te personalizuar. Ne fillim na u desh te gjenim nje dhurues surrogato, dhe pastaj duhet t'i kerkonim Samantes te na jepte disa tinguj. Ajo mund te prodhoje vetem tinguj qe permbajne zanore, por kjo ishte e mjaftueshme qe ne te identifikonim karakteristikat e burimit te saj. Ajo qe ndodhi pas kesaj mund te pershkruhet me ane te analogjise se vajzes sime. Ajo eshte gjashte vjec. Ajo kete e quan perzierje ngjyrash per te krijuar zera. Eshte e mrekullueshme. Ndodh pikerisht keshtu. Zeri i Samantes eshte si nje ekstrakt i perqendruar i bojes ushqimore te kuqe qe e perziejme me rregjistrimet e surrogatos se saj per te arritur tek nje ze i tille.

(Video) Samantha: Aaaaaah.

(video) Samanta: Aaaaaaaa

RP: So now, Samantha can say this.

RP: Tani ajo mund te thoje kete.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(video) Samanta: Ky ze eshte vetem imi. Me zor pres t'a perdor zerin tim te ri me miqte.

RP: Thank you. (Applause)

RP: Faleminderit (duartrokitje)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Nuk do t'a harroj kurre buzeqeshjen e embel qe celi ne fytyren e saj kur degjoi kete ze per here te pare. Ne bote ka miliona njerez si Samanta, miliona, dhe ne sapo kemi filluar te perballemi me problemin. Deri tani neve kemi arritur te kemi disa foles surrogato ne SHBA te cilet kane dhuruar zerat e tyre, dhe ne filluam me to dhe ndertuam vetem disa zera te personalizuar. Por ka ende shume pune per te bere. Ne rastin e Samantes surrogati vinte nga Midwest, nje e huaj e cila i dhuroi asaj zerin. Si shkencetare, une jam shume e lumtur qe t'a nxjerr kete pune jashte laboratorit, me ne fund ne boten reale, ne menyre qe te kete nje impakt global. Dua te ndaj tani me ju visionin tim per hapin e ardhshem te kesaj pune. Imagjinoj nje bote te mbushur me dhurues surrogato nga te gjitha fushat e jetes, me pesha dhe mosha te ndryshme, te cilet bashkohen per t'u dhuruar njerezve zera qe jane plot ngjyra si edhe personalitetet e tyre. Per t'a bere kete, hapi i pare ishte krijimi i nje siti web VocaliD.org, si nje menyre per te mbledhur bashke te gjithe ata qe duan te na bashkohen si dhurues zeri, dhurues ekspertize, apo ne menyra te tjera qe do t'a shnderrojne kete vizion ne realitet.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Thone se duke dhuruar gjak shpeton jete. E pra, duke dhuruar zerin ndryshon jete. Na duhen thjeshte disa ore fjalimi nga folesi surrogato, dhe vetem nje zanore nga folesi i synuar, ne menyre qe te krijojme nje identitet zanor unik.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

Kjo pra eshte shkenca e fshehur pas asaj qe po bejme. Dua t'a mbyll duke u kthyer tek aspekti njerezor, i cili eshte frymezimi i vertete per kete pune. Rreth pese vite me pare ne ndertuam zerin e pare per nje djale te quajtur William. Kur nena e tij e degjoi zerin per here te pare, ajo tha, "Ky do te ishte zeri i Williamit nese ai do te mund te fliste." Pastaj pashe se Williami po shkruante nje mesazh ne aparatin e tij. Thashe, c'te jete duke menduar valle? Imagjinoni sikur te udhetoni me zerin e dikujt tjeter per nente vjet dhe pastaj te gjeni zerin tuaj. Imagjinojeni

This is what William said: "Never heard me before."

Kjo eshte ajo qe Williami tha: "Nuk e kisha degjuar asnjehere veten."

Thank you.

Faleminderit.

(Applause)

(duartrokitje)

I'm going to start by playing you a sample of a voice that you may recognize.

Do filloj duke ju dhene te degjoni nje ze qe ndoshta ju do t'a njihni.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(regjistrim) Stephen Hawking: "Mendoja se ajo qe doja te thoja ishte mjaft e qarte."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Tani do t'ju tregoj nje rregjistrim te nje personi, ne fakt jane dy persona, me crregullime te renda ne te folur. Dua qe te degjoni tingullin e zerit te tyre. Ata thone te njejten gje.

(First voice)

(zeri i pare)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(zeri i dyte) Ndoshta nuk arritet te kuptonit se c'thone, por shpresoj te keni dalluar identitetin unik te zerit te tyre.

(Video) Voice: Things happen in pairs.

(video) Zeri: Gjerat behen ne cift.

I love to sleep.

Mua me pelqen te fle.

The sky is blue without clouds.

Qielli eshte blu dhe pa re.

(Video) Voice: I love chocolate.

(Video) Zeri: Mua me pelqejne cokollatat.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Pra keshte sintetizohet e folura. Quhet sintetizim konkatenativ, dhe eshte ai qe ne perdorim. Por nuk eshte kjo risia. Risia gjendet ne tingullin qe ne i japim si ne rastin e kesaj vajze te re.

(Video) Samantha: Aaaaaah.

(video) Samanta: Aaaaaaaa

RP: So now, Samantha can say this.

RP: Tani ajo mund te thoje kete.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(video) Samanta: Ky ze eshte vetem imi. Me zor pres t'a perdor zerin tim te ri me miqte.

RP: Thank you. (Applause)

RP: Faleminderit (duartrokitje)

This is what William said: "Never heard me before."

Kjo eshte ajo qe Williami tha: "Nuk e kisha degjuar asnjehere veten."

Thank you.

Faleminderit.

(Applause)

(duartrokitje)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice