Mainak Mazumdar: How bad data keeps us from good AI

AI could add 16 trillion dollars to the global economy in the next 10 years. This economy is not going to be built by billions of people or millions of factories, but by computers and algorithms. We have already seen amazing benefits of AI in simplifying tasks, bringing efficiencies and improving our lives. However, when it comes to fair and equitable policy decision-making, AI has not lived up to its promise. AI is becoming a gatekeeper to the economy, deciding who gets a job and who gets an access to a loan. AI is only reinforcing and accelerating our bias at speed and scale with societal implications. So, is AI failing us? Are we designing these algorithms to deliver biased and wrong decisions?

未來 10 年，AI 能為全球經濟增加 16 萬億美元。這種經濟不是由數十億人或數百萬家工廠所建成，而是由電腦和演算法。我們已經看到 AI 在簡化任務、提高效率和改善生活方面帶來的驚人好處。從這 15% 難以接觸到的羣體中收集信息。 AI 還未兌現它的承諾。 AI 正成為經濟的守門人，決定誰能獲得工作，誰能獲得貸款。 AI 只是更快速、更大規模地加深、放大我們的偏見，影響社會。那麼，AI 是否會讓我們失望？我們是否設計這些演算法，去作出偏頗及錯誤決定？

As a data scientist, I'm here to tell you, it's not the algorithm, but the biased data that's responsible for these decisions. To make AI possible for humanity and society, we need an urgent reset. Instead of algorithms, we need to focus on the data. We're spending time and money to scale AI at the expense of designing and collecting high-quality and contextual data. We need to stop the data, or the biased data that we already have, and focus on three things: data infrastructure, data quality and data literacy.

作為數據科學家，我在此告訴你，並不是演算法，而是偏誤的數據造成這些決定。為了讓 AI 能造福人類社會，我們急需重置。我們要關注的不是演算法，而是數據。我的任務不是趕快建立新的演算法，卻忽略了設計和收集高品質、有脈絡的數據。我們需要停止使用現有的偏誤數據，而專注於三件事：數據基礎建設、數據品質和數據素養。

In June of this year, we saw embarrassing bias in the Duke University AI model called PULSE, which enhanced a blurry image into a recognizable photograph of a person. This algorithm incorrectly enhanced a nonwhite image into a Caucasian image. African-American images were underrepresented in the training set, leading to wrong decisions and predictions. Probably this is not the first time you have seen an AI misidentify a Black person's image. Despite an improved AI methodology, the underrepresentation of racial and ethnic populations still left us with biased results.

今年 6 月，我們看到杜克大學，一個名為 PULSE 的 AI 模型尷尬偏誤。它將一張模糊的圖像增強呈現為一張可識別的人物照片。該演算法錯誤地將非白人圖像增強呈現為白人圖像。在訓練集中，非裔美國人圖像的代表性不足，導致誤判和錯的預測。這可能不是你首次看到 AI 誤認黑人圖像。儘管 AI 方法有所改進，但種族和民族人口代表性不足仍然給我們偏誤結果。

This research is academic, however, not all data biases are academic. Biases have real consequences.

這項研究是學術性的，但並非所有數據偏誤都是學術性。偏誤會帶來嚴重後果。

Take the 2020 US Census. The census is the foundation for many social and economic policy decisions, therefore the census is required to count 100 percent of the population in the United States. However, with the pandemic and the politics of the citizenship question, undercounting of minorities is a real possibility. I expect significant undercounting of minority groups who are hard to locate, contact, persuade and interview for the census. Undercounting will introduce bias and erode the quality of our data infrastructure.

以 2020 年美國人口普查為例。人口普查是許多社會和經濟決策的基礎，因此人口普查必須對美國 100% 的人口進行統計。然而，由於新冠疫情和公民身份問題的政治因素，少計少數族裔的情況確實存在。難以找出、聯繫、說服和訪談的少數族裔羣體，我預計他們在人口普查中數量會被嚴重低估。少計會帶來偏誤，也會削弱我們數據基礎設施的品質。

Let's look at undercounts in the 2010 census. 16 million people were omitted in the final counts. This is as large as the total population of Arizona, Arkansas, Oklahoma and Iowa put together for that year. We have also seen about a million kids under the age of five undercounted in the 2010 Census.

讓我們看看 2010 年人口普查中的少計情況。在最終統計有 1600 萬人被忽略，相當於亞利桑那州、阿肯色州、俄克拉荷馬州和愛荷華州當年人口的總和。我們還看到在 2010 年人口普查，少計了約 100 萬名五歲以下兒童。

Now, undercounting of minorities is common in other national censuses, as minorities can be harder to reach, they're mistrustful towards the government or they live in an area under political unrest.

在其他國家的人口普查中，少計少數族裔的情況很常見，因為少數族裔或更難被接觸到，他們不信任政府，或者生活在政治動盪的地區。

For example, the Australian Census in 2016 undercounted Aboriginals and Torres Strait populations by about 17.5 percent. We estimate undercounting in 2020 to be much higher than 2010, and the implications of this bias can be massive.

例如， 2016 年澳洲人口普查少計了約 17.5% 原住民和托雷斯海峽人口。我們估計，2020 年的少計比例將遠高於 2010 年，而這種偏誤的影響可能非常深遠。

Let's look at the implications of the census data. Census is the most trusted, open and publicly available rich data on population composition and characteristics. While businesses have proprietary information on consumers, the Census Bureau reports definitive, public counts on age, gender, ethnicity, race, employment, family status, as well as geographic distribution, which are the foundation of the population data infrastructure. When minorities are undercounted, AI models supporting public transportation, housing, health care, insurance are likely to overlook the communities that require these services the most.

讓我們來看看人口普查數據的影響。人口普查是最可靠、最開放和最公開，關於人口組成和特徵的豐富數據。雖然企業擁有關於消費者的專有信息，但人口普查局會報告明確、公開的統計數據，包括了年齡、性別、民族、種族、就業、家庭狀況以及地理分佈，這些都是人口數據基礎設施的基石。當少數羣體被少計時，支援公共交通、住房、醫療保健和保險的 AI 模型，很可能會忽略最需要這些服務的社群。

First step to improving results is to make that database representative of age, gender, ethnicity and race per census data. Since census is so important, we have to make every effort to count 100 percent. Investing in this data quality and accuracy is essential to making AI possible, not for only few and privileged, but for everyone in the society.

要改善結果的第一步，是根據人口普查數據，使數據庫在年齡、性別、民族和種族方面具有代表性。由於人口普查如此重要，我們必須盡力做到百分百的統計。數據品質和準確性投放資源對促成 AI 非常重要，這並非為了少數人和特權階層，而是為了社會中的每個人。

Most AI systems use the data that's already available or collected for some other purposes because it's convenient and cheap. Yet data quality is a discipline that requires commitment -- real commitment. This attention to the definition, data collection and measurement of the bias, is not only underappreciated -- in the world of speed, scale and convenience, it's often ignored.

大多數 AI 系統使用已有的或為其他目的而收集的數據，因為這樣既方便又便宜。然而數據品質是一門需要投入，真真正正投入的學問。這種對偏誤的定義、數據收集和測量的關注在追求速度、規模和方便的世界裏，不僅沒有得到重視，還常常被忽視。

As part of Nielsen data science team, I went to field visits to collect data, visiting retail stores outside Shanghai and Bangalore. The goal of that visit was to measure retail sales from those stores. We drove miles outside the city, found these small stores -- informal, hard to reach. And you may be wondering -- why are we interested in these specific stores? We could have selected a store in the city where the electronic data could be easily integrated into a data pipeline -- cheap, convenient and easy. Why are we so obsessed with the quality and accuracy of the data from these stores? The answer is simple: because the data from these rural stores matter. According to the International Labour Organization, 40 percent Chinese and 65 percent of Indians live in rural areas. Imagine the bias in decision when 65 percent of consumption in India is excluded in models, meaning the decision will favor the urban over the rural.

作為尼爾森數據科學團隊一員，我曾實地考察走訪了上海和班加羅爾的零售店，收集數據。考察目的是統計這些商店的零售額。我們駕車遠離市外，找到這些小店，非正規、難以接觸到的小店。你可能想知道，我們為何會對這些特定商店感興趣？我們本可以選擇市內的商店，將電子數據輕鬆地整合到數據管道中，便宜、方便、簡單。為何我們對這些商店的數據品質和準確性如此着迷？答案很簡單：因為這些農村商店的數據舉足輕重。根據國際勞工組織的資料， 40% 中國人和 65% 印度人生活在農村地區。試想一下，如果印度 65% 的消費力被排除在模型之外，決䇿上的偏誤，代表着決策會看重城市多於農村。

Without this rural-urban context and signals on livelihood, lifestyle, economy and values, retail brands will make wrong investments on pricing, advertising and marketing. Or the urban bias will lead to wrong rural policy decisions with regards to health and other investments. Wrong decisions are not the problem with the AI algorithm. It's a problem of the data that excludes areas intended to be measured in the first place. The data in the context is a priority, not the algorithms.

如果沒有這種城鄉背景以及有關生計、生活方式、經濟和價值觀的信號，零售品牌就會在定價、廣告和營銷方面做出錯誤投資。甚或城市偏誤將導致有關農村衛生和其他資源投放的錯誤決策。錯誤決策並非 AI 演算法的問題，而是數據問題，因它首先排除了想測量的領域。要優先考慮的是背景數據，而不是演算法。

Let's look at another example. I visited these remote, trailer park homes in Oregon state and New York City apartments to invite these homes to participate in Nielsen panels. Panels are statistically representative samples of homes that we invite to participate in the measurement over a period of time. Our mission to include everybody in the measurement led us to collect data from these Hispanic and African homes who use over-the-air TV reception to an antenna. Per Nielsen data, these homes constitute 15 percent of US households, which is about 45 million people. Commitment and focus on quality means we made every effort to collect information from these 15 percent, hard-to-reach groups.

我們再來看一個例子。我走訪了俄勒岡州這些偏遠的拖車公園住宅和紐約市的公寓，邀請這些家庭參加尼爾森小組。我們邀請這些家庭在一段時間內參與統計，小組是具有統計代表性的家庭樣本。我們的任務是把每人都算進統計，從這些有收看地面電視習慣的西班牙裔和非洲裔家庭中收集數據。根據尼爾森的數據，這些家庭佔美國家庭總數的 15%，約有 4500 萬人。對品質的承諾和關注，代表我們要盡一切努力，從這 15% 難以接觸到的羣體中收集信息。

Why does it matter? This is a sizeable group that's very, very important to the marketers, brands, as well as the media companies. Without the data, the marketers and brands and their models would not be able to reach these folks, as well as show ads to these very, very important minority populations. And without the ad revenue, the broadcasters such as Telemundo or Univision, would not be able to deliver free content, including news media, which is so foundational to our democracy.

它為什麼要緊？這是一個相當大的羣體，對營銷人員、品牌和媒體公司都非常重要。如沒有這些數據，營銷人員、品牌和他們的數據模型就無法接觸到這些人，也無法向這些非常重要的少數羣體投放廣告。沒有廣告收入， Telemundo 或 Univision 等廣播公司就無法提供免費內容，包括對我們的民主至關重要的新聞媒體。

This data is essential for businesses and society. Our once-in-a-lifetime opportunity to reduce human bias in AI starts with the data. Instead of racing to build new algorithms, my mission is to build a better data infrastructure that makes ethical AI possible. I hope you will join me in my mission as well.

這些數據對企業和社會都不可或缺。我們有個千載難逢機會，去減少 AI 中人類的偏見，始於數據。我的任務不是趕快建立新的演算法，而是建立更好的數據基礎設施，使符合道德規範的 AI 成為可能。我希望你也能加入我的任務。

Thank you.

謝謝。

This research is academic, however, not all data biases are academic. Biases have real consequences.

這項研究是學術性的，但並非所有數據偏誤都是學術性。偏誤會帶來嚴重後果。

Now, undercounting of minorities is common in other national censuses, as minorities can be harder to reach, they're mistrustful towards the government or they live in an area under political unrest.

在其他國家的人口普查中，少計少數族裔的情況很常見，因為少數族裔或更難被接觸到，他們不信任政府，或者生活在政治動盪的地區。

例如， 2016 年澳洲人口普查少計了約 17.5% 原住民和托雷斯海峽人口。我們估計，2020 年的少計比例將遠高於 2010 年，而這種偏誤的影響可能非常深遠。

Thank you.

謝謝。

Mainak Mazumdar: How bad data keeps us from good AI

Mainak Mazumdar: How bad data keeps us from good AI

Related talks

Genevieve Bell: 6 big ethical questions about the future of AI

Jamila Gordon: How AI can help shatter barriers to equality

David J. Malan: What's an algorithm?

Tom Gruber: How AI can enhance our memory, work and social lives

Kai-Fu Lee: How AI can save our humanity

Kevin Kelly: How AI can bring on a second Industrial Revolution

Related talks

Genevieve Bell: 6 big ethical questions about the future of AI

Jamila Gordon: How AI can help shatter barriers to equality

David J. Malan: What's an algorithm?

Tom Gruber: How AI can enhance our memory, work and social lives

Kai-Fu Lee: How AI can save our humanity

Kevin Kelly: How AI can bring on a second Industrial Revolution