Rupal Patel: Synthetic voices, as unique as fingerprints

أود الحديث اليوم عن جانب قوي وأساسي من هويتنا: صوتنا. لكل واحد منّا بصمة صوتية فريدة تعكس عمره، وحجمه، وحتى أسلوب حياته وشخصيته. وبكلمات الشاعر "لونجفيلو": "الصوت البشري هو العضو الذي يضم الروح." نظرًا لكوني عالمة في الكلام، يدهشني كيف يتم إصدار الصوت، ولديّ فكرة عن كيف يمكن هندسته. وهذا ما أود أن أشاركه معكم.

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

سأبدأ بتشغيل مقطع صوتي لصوت ربما تستطيعون تمييزه.

I'm going to start by playing you a sample of a voice that you may recognize.

(تسجيل) ستيفن هوكينغ: "كنت لأعتقد أن ما قصدته كان واضحًا بما يكفي."

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

روبال باتل: كان هذا صوت البروفسور ستيفن هوكينغ. ما لا تعرفونه أن الصوت نفسه يمكن أن تستخدمه هذه الفتاة الصغيرة التي لا تستطيع الكلام بسبب حالة عصبية. في الحقيقة، كل هؤلاء الناس قد يستخدمون نفس الصوت، وهذا يرجع إلى أن الاختيارات المتاحة قليلة للغاية. في الولايات المتحدة الأمريكية وحدها، يوجد 2 مليون و 500 ألف أمريكي غير قادرين على الكلام، والعديد منهم يستخدمون أجهزة حاسوب للتواصل. الآن، ملايين من البشر حول العالم يستخدمون أصواتًا عامة، ومن ضمنهم البروفسور هوكينغ، الذي يستخدم صوتًا ذا لهجة أمريكية. غياب تخصيص الصوت الاصطناعي أثّر بي كثيرًا حين كنت في مؤتمر للتقنيات المساعدة منذ عدة أعوام، وأتذكر أنّي دخلت إلى قاعة للعرض ورأيت فتاة صغيرة مع رجل بالغ، يتبادلان الحديث باستخدام أجهزتهما، جهازين مختلفين، لكنهما بنفس الصوت. نظرت حولي ورأيت هذا يحدث في كل مكان من حولي، حرفيًا، مئات من الأشخاص يستخدمون أصواتًا معدودة، أصوات لا تلائم أجسامهم ولا شخصياتهم. لن نحلم أبدًا بملائمة فتاة صغيرة مع طرف اصطناعي لرجل بالغ. فلماذا إذن نفعل ذلك مع الصوت الاصطناعي؟ لقد صدمني ذلك، ورغبت في أن أفعل شيئًا حياله.

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

سأقوم الآن بتشغيل مقطع صوتي لشخص، أو شخصين في الواقع يعانيان من اضطرابات شديدة في الكلام. وأود منكم أن تحاولوا الإنصات للصوت الذي يصدرانه. إنهما يقولان نفس المعنى.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

(الصوت الأول)

(First voice)

(الصوت الثاني) ربما لم تستطيعوا فهم ما قالاه، لكني أتمنى أن تكونوا قد سمعتم هويتهما الصوتية الفريدة.

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

ما أردت فعله بعد ذلك هو، أني أردت التوصل إلى كيفية تسخير تلك القدرات الصوتية المتبقية وبناء تقنية يمكن ملائمتها لتناسبهم، أصوات يمكن ملائمتها لتناسبهم. فذهبت إلى شريكي، تيم بونل. دكتور بونل خبير في تركيب الكلام، وما كان يفعله هو بناء أصوات مخصصة للناس بتجميع مقاطع تم تسجيلها مسبقًا لأصواتهم وإعادة بناء صوت من أجلهم. هؤلاء هم الذين فقدوا أصواتهم بعد ذلك خلال حياتهم. لم تكن لدينا رفاهية الحصول على مقاطع مسجلة لهؤلاء الذين وُلدوا باضطراب في الكلام. لكني فكّرت، يجب أن تكون ثمة طريقة لهندسة صوت عكسيًا من البقايا القليلة الموجودة.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

فقررنا أن نفعل ذلك تمامًا. بدأنا بتمويل محدود من المؤسسة الوطنية للعلوم، لإنشاء أصوات معدّة خصيصًا لتلتقط هوياتهم الصوتية الفريدة. نحن ندعو هذا المشروع فوكاليدي، أو فوكال آي دي، للهوية الصوتية.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

الآن قبل الدخول في تفاصيل كيفية صناعة الصوت وجعلكم تستمعون إليه، أنا بحاجة إلى أن أعطيكم درسًا سريعًا جدًا في علم الكلام. موافقون؟ أولاً، نحن نعلم أن الصوت يتغير بشكل هائل خلال تطورنا يختلف صوت الأطفال عن المراهقين الذين بدورهم لهم أصوات تختلف عن البالغين. كلنا نعرف ذلك جيدًا. الحقيقة الثانية هي أن الكلام هو مزيج بين المصدر، وهو الاهتزازات التي يتم إنشاؤها بواسطة حنجرتك، والتي يتم ضغطها بعد ذلك عبر بقية القناة الصوتية. هذه هي الحجرات الموجودة في برأسك وعنقك التي تهتز، وتقوم فعلاً بترشيح صوت المصدر هذا لتصدر الحروف الساكنة وحروف العلة. ذلك أن الجمع بين المصدر والمُرشّح هي الطريقة التي يصدر صوتنا من خلالها. ويحدث ذلك في فرد واحد.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

لقد أخبرتكم في وقت سابق أني قد أمضيت جزءًا كبيرًا من حياتي المهنية في فهم ودراسة الخصائص التي تميز مصدر الصوت للذين يعانون من اضطراب شديد في الكلام، وما وجدته أنه على الرغم من أن تلك المرشحات كانت معطوبة، فإنهم كانوا قادرين على ضبط مصدرها: درجة الصوت، وارتفاعه، وإيقاعه. وتسمى هذه "العروض"، ولقد كنت لسنوات أقوم بتوثيق أن القدرات "العروضية" لهؤلاء الأفراد محفوظة. وعندما أدركت أن تلك الأمور نفسها مهمة كذلك لهوية المتكلم. تكونت لديّ هذه الفكرة. لماذا لا نأخذ المصدر من الشخص الذي نريد أن يكون الصوت مماثلاً له، لأنه يتم حفظه، ونقترض المُرشح من شخص ما بنفس السن والحجم تقريبًا، لأن بإمكانه التعبير بالكلام، ومن ثم نمزجهما معًا؟ لأنه عندما نمزج بينهما، يمكننا الحصول على صوت واضح، مثل ذلك "المتكلم البديل" — وهو الشخص الذي اقترضنا المُرشح منه — ويشبه في هويته الشخص الذي نريد أن نصنع صوتًا له. الأمر بسيط إلى هذه الدرجة. وهذا هو الجانب العلمي وراء ما نقوم به.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

وعندما تكون لديك تلك الفكرة، كيف يمكنك أن تسعى إلى بناء هذا الصوت؟ حسنًا، يجب عليك البحث عن شخص ما يرغب في أن يكون "متكلمًا بديلاً". إنه ليس شيئًا خطيرًا أبدًا. أن تكون متكلمًا بديلاً. يتطلب منك الأمر فقط أن تقول بضع مئات إلى بضعة آلاف من الكلمات. يسير الأمر على هذا النحو تقريبًا.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

(فيديو) صوت: تحدث الأمور في أزواج.

(Video) Voice: Things happen in pairs.

أحب النوم.

I love to sleep.

السماء زرقاء بلا غيوم.

The sky is blue without clouds.

روبال باتل: الآن سوف تستمر على هذا المنوال لحوالي ثلاث إلى أربع ساعات، والفكرة ليست في أن تقول كل شيء سيريد الشخص المستهدف أن يقوله، لكن الفكرة هي تغطية جميع التركيبات المختلفة من الأصوات التي تحدث في اللغة. كلما كان لديك كلامًا أكثر، صار الصوت الذي ستحصل عليه أفضل. وبمجرد الانتهاء من تلك التسجيلات، ما نحتاج إلى القيام به نقوم بتجزيء تلك التسجيلات إلى قصاصات صغيرة من الكلام، تركيبات صوتية مفردة أو مزدوجة، وأحيانًا كلمات بأكملها تبدأ في ملء مجموعة بيانات أو قاعدة بيانات. سنطلق على قاعدة البيانات تلك "البنك الصوتي". قوة هذا البنك الصوتي أن من هذا البنك الصوتي، يمكننا أن نقول أي كلام جديد، مثل، "أنا أحب الشوكولاته" -- الجميع بحاجة إلى أن يكون قادرًاعلى قول ذلك – تجوّل في قاعدة البيانات تلك وستجد جميع الأجزاء اللازمة لقول هذا الكلام.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

(فيديو) صوت: أنا أحب الشوكولاته.

(Video) Voice: I love chocolate.

روبال باتل: هذا هو تركيب الكلام. إنه يسمى التوليف التراتبي، وهذا ما نستخدمه. ليس هذا هو الجزء المستحدث. المستحدث هو كيف نجعله يبدو كصوت هذه المرأة الشابة.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

هذه سامانثا. التقيتها عندما كانت في التاسعة من عمرها، ومنذ ذلك الحين، فريقي وأنا سعينا إلى بناء صوت مخصص لها. كان علينا أولاً العثور على متبرع بديل، ومن ثم كان علينا أن نجعل سامانثا تقول بعض التعبيرات. ها يمكنها أن تنتجه غالبًا هي أصوات شبيهة بحروف العلة، ولكن هذا يكفينا لاستخراج خصائص مصدرها الصوتي. ما سيحدث بعد ذلك أفضل وصف له هو تعبير ابنتي. هي في السادسة من عمرها. إنها تدعوه "مزج الألوان لتلوين الأصوات". هذا جميل. الأمر كذلك بالضبط. صوت سامانثا يشبه عينة مركزة من صبغ الطعام الأحمر الذي يمكننا أن نبثه في تسجيلات بديلها للحصول على صوت وردي كهذا.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

(فيديو) سامانثا: آااااه.

(Video) Samantha: Aaaaaah.

روبال باتل: حتى الآن، باستطاعة سامانثا قول هذا.

RP: So now, Samantha can say this.

(فيديو) سامانثا: هذا الصوت لي أنا فقط. لا أستطيع الانتظار لاستخدام صوتي الجديد مع أصدقائي.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

روبال باتل: شكرًا لكم. (تصفيق)

RP: Thank you. (Applause)

لن أنسى أبدًا الابتسامة الرقيقة التي انطبعت على وجهها عندما سمعت ذلك الصوت للمرة الأولى. الآن هناك ملايين البشر حول العالم مثل سامانثا، الملايين، ولقد بدأنا فقط بخدش السطح. ما فعلناه حتى الآن أننا نملك القليل من المتكلمين البديلين من جميع أنحاء الولايات المتحدة الأمريكية الذين تبرعوا بأصواتهم، ولقد تم استخدام تلك لبناء الأصوات الأولى التي قمنا بتخصيصها. ولكن هناك أكثر من ذلك بكثير ينبغي القيام به. في حالة سامانثا، جاءت بديلتها من مكان ما في وسط الغرب، شخص غريب قدم لها هدية الصوت. و كعالمة، أنا متحمسة كثيرًا لإخراج هذا العمل من المختبر إلى العالم الحقيقي أخيرًا حتى يكون بإمكاننا التأثير في العالم الحقيقي. ما أريد أن أشارككم به بعد ذلك هو كيف أتصور أخذ هذا العمل إلى هذا المستوى التالي. أتصور عالمًا كاملاً من المانحين البديلين من جميع مناحي الحياة، بأحجام مختلفة، وأعمار مختلفة، يشتركون معًا في محرك الأصوات هذا ليعطوا غيرهم أصوات ملونة كما هي شخصياتهم. للقيام بذلك كخطوة أولى، قمنا بإنشاء هذا الموقع، VocaliD.org، كطريقة للجمع بين الذين يرغبون في الانضمام إلينا كمتبرعين بالأصوات، كمتبرعين بخبراتهم، بأي شكل يجعل من هذه الرؤية حقيقة.

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

يقولون أن التبرع بالدم يمكن أن ينقذ حياة. حسنًا، إن التبرع بصوتك يمكنه تغيير حياة. كل ما نحتاجه بضع ساعات من الكلام من المتكلم البديل، والقليل من المتكلم المستهدف، حتى ولو حرف علة، لإنشاء هوية صوتية فريدة.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

هذا هو الجانب العلمي وراء ما نقوم به. أريد أن أنهى حديثي بالعودة إلى الجانب الإنساني فإنه حقاً مصدر الإلهام لهذا العمل. منذ حوالي خمس سنوات بنينا صوتنا الأول لصبي صغير اسمه وليام. حين سمعت والدته هذا الصوت للمرة الأولى، قالت: "هذا هو الصوت الذي كان ليصدره ويليام لو كان قادرًا على الكلام." ثم رأيت ويليام يكتب رسالة على جهازه تساءلت، بمَ كان يفكر؟ تخيّلوا أن تحملوا صوت شخص آخر لتسعة أعوام وفي النهاية تعثرون على صوتكم الخاص. تخيّلوا ذلك.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

هذا هو ما قاله ويليام: "لم أسمع نفسي قط قبل ذلك."

This is what William said: "Never heard me before."

شكرًا لكم.

Thank you.

(تصفيق)

(Applause)

سأبدأ بتشغيل مقطع صوتي لصوت ربما تستطيعون تمييزه.

I'm going to start by playing you a sample of a voice that you may recognize.

(تسجيل) ستيفن هوكينغ: "كنت لأعتقد أن ما قصدته كان واضحًا بما يكفي."

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

(الصوت الأول)

(First voice)

(الصوت الثاني) ربما لم تستطيعوا فهم ما قالاه، لكني أتمنى أن تكونوا قد سمعتم هويتهما الصوتية الفريدة.

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(فيديو) صوت: تحدث الأمور في أزواج.

(Video) Voice: Things happen in pairs.

أحب النوم.

I love to sleep.

السماء زرقاء بلا غيوم.

The sky is blue without clouds.

(فيديو) صوت: أنا أحب الشوكولاته.

(Video) Voice: I love chocolate.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

(فيديو) سامانثا: آااااه.

(Video) Samantha: Aaaaaah.

روبال باتل: حتى الآن، باستطاعة سامانثا قول هذا.

RP: So now, Samantha can say this.

(فيديو) سامانثا: هذا الصوت لي أنا فقط. لا أستطيع الانتظار لاستخدام صوتي الجديد مع أصدقائي.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

روبال باتل: شكرًا لكم. (تصفيق)

RP: Thank you. (Applause)

هذا هو ما قاله ويليام: "لم أسمع نفسي قط قبل ذلك."

This is what William said: "Never heard me before."

شكرًا لكم.

Thank you.

(تصفيق)

(Applause)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice