Mainak Mazumdar: How bad data keeps us from good AI

AI could add 16 trillion dollars to the global economy in the next 10 years. This economy is not going to be built by billions of people or millions of factories, but by computers and algorithms. We have already seen amazing benefits of AI in simplifying tasks, bringing efficiencies and improving our lives. However, when it comes to fair and equitable policy decision-making, AI has not lived up to its promise. AI is becoming a gatekeeper to the economy, deciding who gets a job and who gets an access to a loan. AI is only reinforcing and accelerating our bias at speed and scale with societal implications. So, is AI failing us? Are we designing these algorithms to deliver biased and wrong decisions?

AI로 세계 경제 규모가 16조 달러나 늘어날 수 있습니다. 10년 후에 말이죠. 이제 경제를 끌고 가는 것은 인간이나 공장이 아닌 컴퓨터와 알고리즘이 될 것입니다. 우리는 이미 AI로부터 엄청난 혜택을 누리고 있습니다. 업무를 단순화하고 효율성을 올리며 우리의 삶을 개선시켰죠. 하지만 공정하고 공평한 정책 의사결정에 대해서는 기대에 부응하지 못했는데요. AI는 경제의 문지기가 되어 취업에 성공할 사람과 대출을 받을 수 있는 사람을 결정하고 있습니다. AI는 사회적 영향을 받아 우리가 가진 편견을 강화하고 그 속도와 규모를 가속화할 뿐입니다. AI가 우리에게 도움을 주지 못하는 걸까요? 편향되고 잘못된 결정을 내리려고 알고리즘을 제작하는 걸까요?

As a data scientist, I'm here to tell you, it's not the algorithm, but the biased data that's responsible for these decisions. To make AI possible for humanity and society, we need an urgent reset. Instead of algorithms, we need to focus on the data. We're spending time and money to scale AI at the expense of designing and collecting high-quality and contextual data. We need to stop the data, or the biased data that we already have, and focus on three things: data infrastructure, data quality and data literacy.

데이터 과학자로서 말씀드리겠습니다. 문제는 알고리즘이 아니라 편향된 데이터입니다. 데이터에 따라서 의사 결정이 달라지는 것이죠. 인류와 사회를 위한 AI를 만들려면 긴급 재정비를 거쳐야 합니다. 알고리즘이 아니라 데이터에 집중해야 합니다. 현재 우리는 AI 기술에 많은 시간과 돈을 들입니다. 많은 비용을 들여 양질의 관련 자료를 설계, 수집하죠. 우리는 이런 데이터들과 이미 보유하고 있는 편향된 데이터 사용을 멈추고 세 가지에 집중해야 합니다. 데이터 관련 기반시설, 데이터의 품질, 데이터 문해력.

In June of this year, we saw embarrassing bias in the Duke University AI model called PULSE, which enhanced a blurry image into a recognizable photograph of a person. This algorithm incorrectly enhanced a nonwhite image into a Caucasian image. African-American images were underrepresented in the training set, leading to wrong decisions and predictions. Probably this is not the first time you have seen an AI misidentify a Black person's image. Despite an improved AI methodology, the underrepresentation of racial and ethnic populations still left us with biased results.

지난 6월, 당황스러운 일이 있었습니다. 듀크 대학의 AI 모델인 PULSE가 흐릿한 사진을 개선해서 인식 가능한 인물 사진으로 바꾸었는데 잘못된 알고리즘이 유색인종을 백인처럼 만드는 결과를 만들었습니다. 학습 단계에서 흑인 사진을 적게 제공했기 때문에 잘못된 결정과 예측으로 이어진 것이죠. 아마 이번이 처음은 아닐 거예요. AI가 흑인의 사진을 잘못 인식한 걸 보신 적이 있을 겁니다. AI 방법론이 개선되었음에도 불구하고 다양한 인종, 민족성의 대표성이 부족하여 여전히 편향된 결과를 안겨주었습니다.

This research is academic, however, not all data biases are academic. Biases have real consequences.

이 연구는 학문적이지만, 모든 데이터 성향이 학문적인 것은 아닙니다. 편견이 진짜 결과를 보여주는 겁니다.

Take the 2020 US Census. The census is the foundation for many social and economic policy decisions, therefore the census is required to count 100 percent of the population in the United States. However, with the pandemic and the politics of the citizenship question, undercounting of minorities is a real possibility. I expect significant undercounting of minority groups who are hard to locate, contact, persuade and interview for the census. Undercounting will introduce bias and erode the quality of our data infrastructure.

2020년 미국 인구 조사를 보시죠. 인구 조사는 많은 사회, 경제 정책 결정을 위한 주춧돌 역할을 합니다. 그렇기에 미국 내 총 인구 수를 100% 계산해야 합니다. 그러나, 팬데믹과 시민권에 대한 정치적인 문제로 인해 소수 집단을 실제 인구 수보다 적게 세는 일이 일어납니다. 소수 집단 인구 수 차이가 매우 클 거라고 생각해요. 조사를 위해 거주지를 찾고, 연락하고, 설득하고, 인터뷰하기가 힘드니까요. 계산 오류는 편견을 갖게 하고 데이터 기반의 질을 떨어뜨립니다.

Let's look at undercounts in the 2010 census. 16 million people were omitted in the final counts. This is as large as the total population of Arizona, Arkansas, Oklahoma and Iowa put together for that year. We have also seen about a million kids under the age of five undercounted in the 2010 Census.

2010년 인구 조사에서 과소 집계된 결과를 봅시다. 1천 6백만 명이 최종 집계에서 누락되었습니다. 그 숫자가 어느 정도 규모냐 하면 애리조나, 아칸소, 오클라호마, 그리고 아이오와 주의 전체 인구를 합친 것과 같죠. 그리고 2010년 인구 조사에서는 5세 이하 아동이 약 100만 명 정도나 적게 계산되었습니다.

Now, undercounting of minorities is common in other national censuses, as minorities can be harder to reach, they're mistrustful towards the government or they live in an area under political unrest.

현재, 소수 집단에 대한 계산 오류는 다른 국가의 인구 조사에서도 흔하게 일어납니다. 소수 집단은 접근성이 떨어지니까요. 그들은 정부를 불신하거나 정치적으로 불안한 지역에 거주하는데요.

For example, the Australian Census in 2016 undercounted Aboriginals and Torres Strait populations by about 17.5 percent. We estimate undercounting in 2020 to be much higher than 2010, and the implications of this bias can be massive.

예를 들어, 2016년 호주 인구 조사에서는 호주 원주민과 토레스 해협 내 인구를 더 적게 계산했습니다. 약 17.5% 정도 차이가 났죠. 2020년 인구조사 오차는 2010년보다 훨씬 클 것입니다. 이런 편차가 가진 영향력은 정말 어마어마하죠.

Let's look at the implications of the census data. Census is the most trusted, open and publicly available rich data on population composition and characteristics. While businesses have proprietary information on consumers, the Census Bureau reports definitive, public counts on age, gender, ethnicity, race, employment, family status, as well as geographic distribution, which are the foundation of the population data infrastructure. When minorities are undercounted, AI models supporting public transportation, housing, health care, insurance are likely to overlook the communities that require these services the most.

인구 조사 데이터의 영향력을 봅시다. 인구 조사는 가장 신뢰도 높은 양질의 공공 데이터로서 인구 구성과 특징에 대한 정보를 제공합니다. 기업은 소비자에 대해 적정 정보를 보유하는 반면에 인구 조사 기관은 정확한 인구 수를 보고하기 위해 나이와 성별, 민족성, 인종, 취업 상태, 가족 구성까지 반영합니다. 지리적 분포뿐만 아니고 말이죠, 그것들이 인구 데이터 기반의 기초 자료가 됩니다. 만약 소수 집단을 더 적게 계산하면 대중교통, 주택, 보건, 보험을 지원하는 AI 모델이 그런 서비스를 가장 필요로 하는 지역 주민들을 간과하기 쉽습니다.

First step to improving results is to make that database representative of age, gender, ethnicity and race per census data. Since census is so important, we have to make every effort to count 100 percent. Investing in this data quality and accuracy is essential to making AI possible, not for only few and privileged, but for everyone in the society.

더 나은 결과를 얻기 위한 첫 단계는 인구 통계 자료마다 나이와 성별, 민족성, 인종을 대표하는 데이터베이스를 만드는 겁니다. 인구 조사가 중요한 만큼 100% 정확히 세기 위해 최선의 노력을 기울여야 합니다. 데이터의 품질과 정확성에 투자하는 것은 AI를 구현하는 데 필수적입니다. 일부 특권층을 위해서가 아니라 사회의 모두를 위해서요.

Most AI systems use the data that's already available or collected for some other purposes because it's convenient and cheap. Yet data quality is a discipline that requires commitment -- real commitment. This attention to the definition, data collection and measurement of the bias, is not only underappreciated -- in the world of speed, scale and convenience, it's often ignored.

대부분의 AI 시스템이 사용하는 데이터는 기존에 가지고 있었거나 다른 목적으로 수집된 것들입니다. 간편하고 저렴하니까요. 하지만 데이터의 품질에는 책임이 뒤따릅니다. 진짜 책임이요. 데이터 품질의 정의와 데이터 수집, 편향성 측정에 주목하는 것은 좋은 평가를 받지 못할 뿐만 아니라 속도와 규모, 편리성을 추구하는 요즘 세상에는 아예 무시되기도 합니다..

As part of Nielsen data science team, I went to field visits to collect data, visiting retail stores outside Shanghai and Bangalore. The goal of that visit was to measure retail sales from those stores. We drove miles outside the city, found these small stores -- informal, hard to reach. And you may be wondering -- why are we interested in these specific stores? We could have selected a store in the city where the electronic data could be easily integrated into a data pipeline -- cheap, convenient and easy. Why are we so obsessed with the quality and accuracy of the data from these stores? The answer is simple: because the data from these rural stores matter. According to the International Labour Organization, 40 percent Chinese and 65 percent of Indians live in rural areas. Imagine the bias in decision when 65 percent of consumption in India is excluded in models, meaning the decision will favor the urban over the rural.

저는 닐슨 데이터 과학 팀의 일원으로서 데이터를 수집하기 위해 현장을 방문했습니다. 상하이와 방갈로어 외곽에 위치한 상점이었는데요. 방문의 목적은 상점의 판매액을 측정하는 것이었습니다. 도시 밖으로 수 km를 달려 작은 가게들을 방문했습니다. 허름하고 접근성이 떨어지는 가게들이죠. 이제 여러분은 궁금하실 겁니다. 왜 이런 작은 가게에 관심을 가졌을까요? 도시에 있는 상점을 선택할 수도 있었는데 말이죠. 도시는 전자 데이터가 잔송망을 통해 쉽게 통합되어 비용이 저렴하고 편리하며 쉽게 확보할 수 있는데, 왜 시골 가게의 데이터의 질과 정확성에 그렇게 집착했던 걸까요? 정답은 간단합니다. 이런 시골 가게의 데이터가 중요하기 때문이죠. 국제 노동 기구에 따르면, 중국인 40%와 인도인 65%가 농촌 지역에 거주합니다. 그에 따른 편향된 결정을 상상해보세요. 인도 내 소비 주체의 65%가 분석 모델에서 배제되어 도시만 혜택을 보게 될 겁니다.

Without this rural-urban context and signals on livelihood, lifestyle, economy and values, retail brands will make wrong investments on pricing, advertising and marketing. Or the urban bias will lead to wrong rural policy decisions with regards to health and other investments. Wrong decisions are not the problem with the AI algorithm. It's a problem of the data that excludes areas intended to be measured in the first place. The data in the context is a priority, not the algorithms.

시골과 도시 상황을 모르고 생계 수단, 생활 방식, 경제, 가치에 대한 신호를 알 수 없다면, 소매 기업은 가격 결정과 광고, 마케팅에 잘못된 투자를 하게 될 것입니다. 혹은 도시 편향적인 데이터로 인해 농촌 지역의 보건과 재정 투입에 관한 잘못된 정책 결정을 내릴 수도 있죠. 잘못된 의사결정은 AI 알고리즘만의 문제가 아닙니다. 데이터의 문제에요. 애초에 측정하려던 지역이 배제된 편향된 데이터가 문제죠. 일관된 데이터가 우선입니다. 알고리즘이 아니라요.

Let's look at another example. I visited these remote, trailer park homes in Oregon state and New York City apartments to invite these homes to participate in Nielsen panels. Panels are statistically representative samples of homes that we invite to participate in the measurement over a period of time. Our mission to include everybody in the measurement led us to collect data from these Hispanic and African homes who use over-the-air TV reception to an antenna. Per Nielsen data, these homes constitute 15 percent of US households, which is about 45 million people. Commitment and focus on quality means we made every effort to collect information from these 15 percent, hard-to-reach groups.

다른 예시를 보겠습니다. 저는 오레곤 주에서 외딴 이동주택 지역과 뉴욕 시 아파트를 방문했습니다. 방문 목적은 해당 가정들을 닐슨 자문단으로 모시기 위해서였는데요. 자문단은 통계 분석에서 대표 표본이 되는 가구로서 선정 후 일정 기간 동안 통계 조사에 참여하게 됩니다. 우리는 모든 대상을 조사에 포함시켜야 했고 남미 출신 가정과 흑인 가정의 데이터도 수집했습니다. 그들 가정은 지상파 TV 수신에 안테나를 사용하고 있었죠. 닐슨 데이터에 따르면 이런 가정이 미국 전체의 15%를 차지합니다. 약 4천 5백만 명에 달하는데요. 우수한 데이터를 약속하고 이에 집중하려면 15%에 달하는 소외 그룹의 정보를 수집하기 위해 노력해야 합니다.

Why does it matter? This is a sizeable group that's very, very important to the marketers, brands, as well as the media companies. Without the data, the marketers and brands and their models would not be able to reach these folks, as well as show ads to these very, very important minority populations. And without the ad revenue, the broadcasters such as Telemundo or Univision, would not be able to deliver free content, including news media, which is so foundational to our democracy.

그게 왜 중요할까요? 이 집단은 규모가 상당해서 판촉과 상품 측면에서도 아주 아주 중요합니다. 미디어 회사뿐만 아니라요. 그 데이터가 없다면 판촉과 상품 그리고 영업 모델에 있어서 그들에게 접근성도 떨어집니다. 중요한 소수 집단 인구 대상의 광고 노출은 매우 중요하기 때문이죠. 광고 수익이 없다면 텔레문도와 유니비전과 같은 방송사는 무료 콘텐츠를 제공할 수 없을 것입니다. 민주주의에 있어 가장 기본이 되는 뉴스 미디어를 포함해서요.

This data is essential for businesses and society. Our once-in-a-lifetime opportunity to reduce human bias in AI starts with the data. Instead of racing to build new algorithms, my mission is to build a better data infrastructure that makes ethical AI possible. I hope you will join me in my mission as well.

이 데이터는 기업체와 사회에 반드시 필요합니다. AI가 가진 편견을 없애기 위한 절호의 기회를 잡으려면 데이터부터 시작해야 합니다. 새 알고리즘을 만드는 데에 경쟁하기보다 개선된 데이터 기반을 구축하는 것이 저의 목표입니다. 그래야 윤리적인 AI를 만들 수 있으니까요. 여러분도 동참해주시길 바랍니다.

Thank you.

감사합니다.

This research is academic, however, not all data biases are academic. Biases have real consequences.

이 연구는 학문적이지만, 모든 데이터 성향이 학문적인 것은 아닙니다. 편견이 진짜 결과를 보여주는 겁니다.

Now, undercounting of minorities is common in other national censuses, as minorities can be harder to reach, they're mistrustful towards the government or they live in an area under political unrest.

Thank you.

감사합니다.

Mainak Mazumdar: How bad data keeps us from good AI

Mainak Mazumdar: How bad data keeps us from good AI

Related talks

Genevieve Bell: 6 big ethical questions about the future of AI

Jamila Gordon: How AI can help shatter barriers to equality

David J. Malan: What's an algorithm?

Tom Gruber: How AI can enhance our memory, work and social lives

Kai-Fu Lee: How AI can save our humanity

Kevin Kelly: How AI can bring on a second Industrial Revolution

Related talks

Genevieve Bell: 6 big ethical questions about the future of AI

Jamila Gordon: How AI can help shatter barriers to equality

David J. Malan: What's an algorithm?

Tom Gruber: How AI can enhance our memory, work and social lives

Kai-Fu Lee: How AI can save our humanity

Kevin Kelly: How AI can bring on a second Industrial Revolution