Mainak Mazumdar: How bad data keeps us from good AI

يمكن أن يضيف الذكاء الصناعي 16 مليار دولار للاقتصاد العالمي خلال 10 سنوات القادمة. لن يصنع هذا الاقتصاد من خلال مليارات الناس أو ملايين المصانع، ولكن من خلال أجهزة الكمبيوتر الخوارزميات. شاهدنا لغاية الآن منافع عظيمة للذكاء الصناعي من خلال تبسيط المهام، وايجاد الكفاءة وتحسين جودة حياتنا. ومع ذلك، عندما يتعلق الأمر بالإنصاف وصنع القرار السياسي العادل، لم يرقى الذكاء الصناعي للمستوى المتوقع منه. أصبح الذكاء الصناعي حارس البوابة للإقتصاد، ليقرر من يحصل على الوظائف ومن يمكنه الحصول على القروض. يعزز الذكاء الصناعي ويسرع من تحيزنا بسرعة وعلى نطاق تترتب عليه آثار مجتمعية. إذّا، هل يخذلنا الذكاء الصناعي؟ هل نصمم تلك الخوارزميات لتنتج عنها قرارات متحيزة وخاطئة؟

AI could add 16 trillion dollars to the global economy in the next 10 years. This economy is not going to be built by billions of people or millions of factories, but by computers and algorithms. We have already seen amazing benefits of AI in simplifying tasks, bringing efficiencies and improving our lives. However, when it comes to fair and equitable policy decision-making, AI has not lived up to its promise. AI is becoming a gatekeeper to the economy, deciding who gets a job and who gets an access to a loan. AI is only reinforcing and accelerating our bias at speed and scale with societal implications. So, is AI failing us? Are we designing these algorithms to deliver biased and wrong decisions?

كعالم للبيانات، أنا هنا لأقول لكم المشكلة ليست الخوازميات، بل البيانات المتحيزة المسؤولة عن تلك القرارات. لكي نجعل الذكاء الصناعي متاحًا للإنسانية والمجتمعات، نحتاج لإعادة تشغيل عاجلة. بدلًا من الخوارزميات نحتاج أن نركز على البيانات. نحن ننفق الوقت والمال للتوسع في الذكاء الصناعي على حساب التصميم وتجميع بيانات عالية الجودة ومتناسقة. يجب أن نوقف البيانات المتحيزة التي لدينا حاليًا، ونركز على 3 نقاط: البنية التحتية للبيانات، جودة البيانات ومحو أمية البيانات.

As a data scientist, I'm here to tell you, it's not the algorithm, but the biased data that's responsible for these decisions. To make AI possible for humanity and society, we need an urgent reset. Instead of algorithms, we need to focus on the data. We're spending time and money to scale AI at the expense of designing and collecting high-quality and contextual data. We need to stop the data, or the biased data that we already have, and focus on three things: data infrastructure, data quality and data literacy.

في شهر يونيو لهذا العام، شاهدنا تحيز الذكاء الصناعي في جامعة ديوك بشكل مخجل وهو يسمى “بلص“، وهو يقوم بإيضاح صورة مبهمة ليتكون منها صورة واضحة يمكن تميز شخص ما منها. قامت تلك الخوارزمية بشكل خاطئ بايضاح صورة لشخص غير أبيض لشخص ذو بشرة بيضاء. كانت المواد التدريبية فقرة المحتوى للأشخاص من ذوي البشرة السمراء، ما أدى لحدوث قرارات واتنتاجات خاطئة. لعلها لم تكن المرة الأولى التي ترى ذكاء صناعي يخطئ في التعرف على صورة أشخاص ذوي بشرة سمراء. على الرغم من منهجية محسنة للذكاء الصناعي، قلة المحتوى الخاص للأعراق والإثنيات المختلفة مازال يتركنا مع نتائج منحازة.

In June of this year, we saw embarrassing bias in the Duke University AI model called PULSE, which enhanced a blurry image into a recognizable photograph of a person. This algorithm incorrectly enhanced a nonwhite image into a Caucasian image. African-American images were underrepresented in the training set, leading to wrong decisions and predictions. Probably this is not the first time you have seen an AI misidentify a Black person's image. Despite an improved AI methodology, the underrepresentation of racial and ethnic populations still left us with biased results.

هذا بحث أكاديمي، ولكن ليست جميع تحيزات البيانات أكاديمية. التحير له عواقب حقيقية.

This research is academic, however, not all data biases are academic. Biases have real consequences.

خذ مثلًا تعداد السكان للولايات المتحدة عام 2020. التعداد هو الأساس للعديد من السياسات والقرارات الإجتماعية والإقتصادية، لذا يجب أن يشتمل التعداد 100% من جميع السكان في الولايات المتحدة. ولكن مع الوباء والسياسة ومسألة المواطنة، عدم اجراء تعداد كامل للأقليات هو أمر محتمل الوقوع. أتوقع حدوث عدم احصاء كامل للأقليات اللذين من الصعب تحديد موقعهم أو التواصل معهم أو حثهم للمشاركة في الإحصاء. القصور في الإحصاء سيتسبب في التحيز ويقوض جودة البنية التحتية لبياناتنا.

Take the 2020 US Census. The census is the foundation for many social and economic policy decisions, therefore the census is required to count 100 percent of the population in the United States. However, with the pandemic and the politics of the citizenship question, undercounting of minorities is a real possibility. I expect significant undercounting of minority groups who are hard to locate, contact, persuade and interview for the census. Undercounting will introduce bias and erode the quality of our data infrastructure.

لننظر في القصور في إحصاء عام 2010. تم اقصاء 16 مليون شخص في الإحصاء الأخير. وهذا يساوي تعداد السكان كاملًا لكل من أريزونا وأركنساس وأوكلاهوما وأيوا مجتمعين. كما شهدنا اقصاء مليون طفل تحت سنة 5 سنوات من الإحصاء خلال تعداد 2010.

Let's look at undercounts in the 2010 census. 16 million people were omitted in the final counts. This is as large as the total population of Arizona, Arkansas, Oklahoma and Iowa put together for that year. We have also seen about a million kids under the age of five undercounted in the 2010 Census.

الحقيقة، القصور في تعداد الأقليات أمر معتاد في إحصاءات وطنية أخرى، حيث من الصعب الوصول بعض الأقليات، إنهم لا يثقون في الحكومة أو أنهم يعيشون في منطقة تعاني من عدم الاستقرار السياسي.

Now, undercounting of minorities is common in other national censuses, as minorities can be harder to reach, they're mistrustful towards the government or they live in an area under political unrest.

مثلًا، الإحصاء في أستراليا عام 2016 لم يتم عد السكان الأصليين وسكان مضيق توريس بنسبة 17.5% نتوقع القصور في الإحصاء عام 2020 سيكون أعلى بكثير من تعداد 2010، وعواقب هذا التحيز ستكون هائلة.

For example, the Australian Census in 2016 undercounted Aboriginals and Torres Strait populations by about 17.5 percent. We estimate undercounting in 2020 to be much higher than 2010, and the implications of this bias can be massive.

لننظر لنتائج بيانات الإحصاء. التعداد هي أكثر البيانات الثرية ثقة وانفتاحًا ومتاحة للجمهور بخصوص التركيبة السكانية وخصائصها. وفي حين أن الشركات لديها معلوماتها الخاصة عن المستهلكين، تقارير هيئة الإحصاء دقيقة وعامة بخصوص العمر والجنس والعرق السلالة والحالة الوظيفية والإجتماعية والتوزيع الجغرافي، والتي تعتبر الأساس للبنية التحتية معلومات الشعب كاملًا. وحين يتم تقليل حجم تعداد الأقليات، تكون نماذج الذكاء الصناعي التي تتعامل مع المواصلات العامة، الإسكان والرعاية الصحية التأمين من المرجح أنها ستتغاضى عن المجتمعات التي تحتاج هذه الخدمات بصورة أكبر.

Let's look at the implications of the census data. Census is the most trusted, open and publicly available rich data on population composition and characteristics. While businesses have proprietary information on consumers, the Census Bureau reports definitive, public counts on age, gender, ethnicity, race, employment, family status, as well as geographic distribution, which are the foundation of the population data infrastructure. When minorities are undercounted, AI models supporting public transportation, housing, health care, insurance are likely to overlook the communities that require these services the most.

الخطوة الأولى لتحسين النتائج هي جعل قواعد البيانات أكثر شمولية للعمر والجنس والعرق لكل بيانات التعداد. بما أن الإحصاء هام للغاية، يجب بذل كل ما نستطيع لجعل نسبة التعداد 100%. الإستشمار في جودة ودقة هذه البيانات هو أمر أساسي لجعل الذكاء الصناعي أمرًا ممكنًا، ليس فقط للقليل وذوي الامتياز، ولكن لكل شخص في المجتمع.

First step to improving results is to make that database representative of age, gender, ethnicity and race per census data. Since census is so important, we have to make every effort to count 100 percent. Investing in this data quality and accuracy is essential to making AI possible, not for only few and privileged, but for everyone in the society.

تستخدم معظم نماذج الذكاء الصناعي البيانات المتوفرة حاليًا أو تم تجميعها لأغراض أخرى لأنها متاحة ورخيصة. تتطلب جودة البيانات الانضباط والالتزام التزام حقيقي. هذا الالتزام بالايضاح جمع البيانات وقياس نسبة التحيز، الاهتمام بهذا الأمر قليل للغاية في عالم السرعة، الحجم والراحة، يتم اهمال هذا الأمر غالبًا.

Most AI systems use the data that's already available or collected for some other purposes because it's convenient and cheap. Yet data quality is a discipline that requires commitment -- real commitment. This attention to the definition, data collection and measurement of the bias, is not only underappreciated -- in the world of speed, scale and convenience, it's often ignored.

كجزء من فريق علوم بيانات نيسلين، ذهبت في رحلات ميدانية لتجميع البيانات، أزور متاجر تقع خارج شنغهاي وبانجلور. كان الهدف من تلك الزيارة قياس مبيعات التجزئة من تلك المتاجر. قدنا لأميال خارج المدينة، ووجدنا تلك المتاجر الصغيرة غير رسمية، يصعب الوصول لها. وربما كنت تتسائل لماذا نحن معتمين بتلك المتاجر بالتحديد؟ كان بمقدورنا اختيار أحد المتاجر في المدينة حيث يمكن للبيانات الالكترونية أن تدمج بسهولة في عملية جميع البيانات غير مكلفة ومتاحة وسهلة. لماذا نحن مهووسون في جودة ودقة البيانات من تلك المتاجر؟ الإجابة بسيطة: لأن البيانات من تلك المتاجر النائية مهم. بحسب منظمة العمل الدولية، 40% من الصينيين و65% من الهنود يعيشون في مناطق نائية. تخيل التحيز في القرارات عندما يتم تهميش 65% من الاتسهلاك في الهند من النماذج، ما يعني أن القرارات ستفضل المدن على المناطق الريفية.

As part of Nielsen data science team, I went to field visits to collect data, visiting retail stores outside Shanghai and Bangalore. The goal of that visit was to measure retail sales from those stores. We drove miles outside the city, found these small stores -- informal, hard to reach. And you may be wondering -- why are we interested in these specific stores? We could have selected a store in the city where the electronic data could be easily integrated into a data pipeline -- cheap, convenient and easy. Why are we so obsessed with the quality and accuracy of the data from these stores? The answer is simple: because the data from these rural stores matter. According to the International Labour Organization, 40 percent Chinese and 65 percent of Indians live in rural areas. Imagine the bias in decision when 65 percent of consumption in India is excluded in models, meaning the decision will favor the urban over the rural.

وبدون هذا السياق الريفي الحضري وبيانات أسلوب العيش نمط الحياة والاقتصاد والقيم، ماركات البيع بالتجزئة ستقوم باستثمارات خاطئة على الأسعار والإعلان والتسويق. تحيزنا المدني سيقودنا لقرارات وسياسات خاطئة للمناطق النائية فيما يتعلق بالصحة واستثمارات اخرى. القرارات الخاطئة ليست مشكلة بخوارزميات الذكاء الصناعي. بل هي مشكلة في البيانات التي لم تشمل مناطق كانت مستهدفه في المقام الأول. وجود البيانات في سياق معين مهم، وليست الخوارزميات.

Without this rural-urban context and signals on livelihood, lifestyle, economy and values, retail brands will make wrong investments on pricing, advertising and marketing. Or the urban bias will lead to wrong rural policy decisions with regards to health and other investments. Wrong decisions are not the problem with the AI algorithm. It's a problem of the data that excludes areas intended to be measured in the first place. The data in the context is a priority, not the algorithms.

لننظر لمثال آخر. زرت حدائق توقف المقطورات المنزلية النائية في ولاية اوريجون وشقق في ولاية نيويورك لدعوة سكانها للمشاركة في اجتماعات نيلسون. تمثل تلك الاجتماعات بشكل احصائي عينة من تلك المنازل اليتي دعوناها للمشاركة في تلك الاحصائيات خلال فترة زمنية معينة. مهمتنا لضمان مشاركة الجميع في تلك الاحصائيات قادنا لتجميع بيانات من منازل ذوي اصول لاتينية وافريقية الذين يستخدمون هوائي استقبال لمشاهدة التلفاز في منازلهم. طبقًا لبيانات نيلسون، تشكل هذه المنازل ١٥٪؜ من مجموع المنازل الامريكية، والذي يبلغ 45 مليون شخص. التزامنا وتركيزنا على الجودة جعلنا نبذل كل جهد ممكن لجمع البيانات من تلك 15٪؜ اللتي من الصعب الوصل لها

Let's look at another example. I visited these remote, trailer park homes in Oregon state and New York City apartments to invite these homes to participate in Nielsen panels. Panels are statistically representative samples of homes that we invite to participate in the measurement over a period of time. Our mission to include everybody in the measurement led us to collect data from these Hispanic and African homes who use over-the-air TV reception to an antenna. Per Nielsen data, these homes constitute 15 percent of US households, which is about 45 million people. Commitment and focus on quality means we made every effort to collect information from these 15 percent, hard-to-reach groups.

لماذا هذا مهم؟ هذه المجموعة التي لها وزن هي هامة للغاية للاسواق والعلامات التجارية كما هو ذات الامر للشركات الاعلامية. وبون تلك البيانات، فإن الأسواق والعلامات التجارية ومنتجاتها لن تكون قادرة للوصول لهؤلاء الناس، وبث اعلانات تستهدف تلك الاقليات ذات الأهمية الشديدة. وبدن العوائد الإعلانية، فان شركات البث مثل تيليمندو وينيفيجن، لن تكون قادر على تقديم محتوى مجاني، بما فيها شركات الاخبار، وهو اهم اساسي للغاية لديموقراطيتنا.

Why does it matter? This is a sizeable group that's very, very important to the marketers, brands, as well as the media companies. Without the data, the marketers and brands and their models would not be able to reach these folks, as well as show ads to these very, very important minority populations. And without the ad revenue, the broadcasters such as Telemundo or Univision, would not be able to deliver free content, including news media, which is so foundational to our democracy.

هذه البيانات اساسية للاعمال والمجتمع. هذه الفرصة التي تاتي مره في العمل لتقليل الانحياز الانساني في الذكاء الصناعي تبدا في البيانات. بدلامن التسابق لصنع خوارزميات جديدة، مهمتي هي صنع بنية تحتية افضل للبيانات التي تجعل من الممكن الحصول على ذكاء صناعي اثني. اتمنى ان تنضموا الي في مهمتي.

This data is essential for businesses and society. Our once-in-a-lifetime opportunity to reduce human bias in AI starts with the data. Instead of racing to build new algorithms, my mission is to build a better data infrastructure that makes ethical AI possible. I hope you will join me in my mission as well.

شكرا لكم.

Thank you.

هذا بحث أكاديمي، ولكن ليست جميع تحيزات البيانات أكاديمية. التحير له عواقب حقيقية.

This research is academic, however, not all data biases are academic. Biases have real consequences.

Now, undercounting of minorities is common in other national censuses, as minorities can be harder to reach, they're mistrustful towards the government or they live in an area under political unrest.

شكرا لكم.

Thank you.

Mainak Mazumdar: How bad data keeps us from good AI

Mainak Mazumdar: How bad data keeps us from good AI

Related talks

Genevieve Bell: 6 big ethical questions about the future of AI

Jamila Gordon: How AI can help shatter barriers to equality

David J. Malan: What's an algorithm?

Tom Gruber: How AI can enhance our memory, work and social lives

Kai-Fu Lee: How AI can save our humanity

Kevin Kelly: How AI can bring on a second Industrial Revolution

Related talks

Genevieve Bell: 6 big ethical questions about the future of AI

Jamila Gordon: How AI can help shatter barriers to equality

David J. Malan: What's an algorithm?

Tom Gruber: How AI can enhance our memory, work and social lives

Kai-Fu Lee: How AI can save our humanity

Kevin Kelly: How AI can bring on a second Industrial Revolution