Cathy O'Neil: The era of blind faith in big data must end

Algorithms are everywhere. They sort and separate the winners from the losers. The winners get the job or a good credit card offer. The losers don't even get an interview or they pay more for insurance. We're being scored with secret formulas that we don't understand that often don't have systems of appeal. That begs the question: What if the algorithms are wrong?

Algoritmi su svuda. Filtriraju i odvajaju pobednike od gubitnika. Pobednici dobijaju posao ili dobru ponudu kreditne kartice. Za gubitnike nema ni intervjua ili plaćaju više za osiguranje. Ocenjuju nas tajne formule koje ne razumemo, za koje često ne postoje sistemi za podnošenje žalbe. To nas tera se zapitamo: „Šta ako algoritmi greše?“

To build an algorithm you need two things: you need data, what happened in the past, and a definition of success, the thing you're looking for and often hoping for. You train an algorithm by looking, figuring out. The algorithm figures out what is associated with success. What situation leads to success?

Da napravite algoritam potrebne su vam dve stvari: podaci, šta je bilo u prošlosti, i definicija uspeha, koji tražite i kojem se često nadate. Osposobljavate algoritam posmatranjem i zaključivanjem. Algoritam pronalazi vezu sa uspehom. Koja situacija vodi ka uspehu?

Actually, everyone uses algorithms. They just don't formalize them in written code. Let me give you an example. I use an algorithm every day to make a meal for my family. The data I use is the ingredients in my kitchen, the time I have, the ambition I have, and I curate that data. I don't count those little packages of ramen noodles as food.

Zapravo, svi koriste algoritme, ali ih ne formalizuju u pisani kôd. Evo vam primera. Svaki dan koristim algoritam da napravim porodici doručak. Podaci koje koristim su: namirnice u kuhinji, vreme koje imam na raspolaganju, ambicija koju imam, i te podatke obrađujem. U hranu ne računam ona mala pakovanja instant špageta.

(Laughter)

(Smeh)

My definition of success is: a meal is successful if my kids eat vegetables. It's very different from if my youngest son were in charge. He'd say success is if he gets to eat lots of Nutella. But I get to choose success. I am in charge. My opinion matters. That's the first rule of algorithms.

Moja definicija uspeha je - obrok je uspeo ako moja deca jedu povrće. To bi izgledalo mnogo drugačije da se moj sin pita. Njemu bi uspeh bio da se najede nutele. Ali, ja biram uspeh. Ja sam zadužena za to. Moje mišljenje je važno. To je prvo pravilo algoritama.

Algorithms are opinions embedded in code. It's really different from what you think most people think of algorithms. They think algorithms are objective and true and scientific. That's a marketing trick. It's also a marketing trick to intimidate you with algorithms, to make you trust and fear algorithms because you trust and fear mathematics. A lot can go wrong when we put blind faith in big data.

Algoritmi su mišljenja ugrađena u kôd. To se veoma razlikuje od onoga što mislite da većina ljudi misli o algoritmima. Ljudi misle da su algoritmi objektivni, istiniti i naučni. To je marketinški trik. Marketinški trik jeste i kada vas plaše algoritmima, kada vas teraju da verujete i da se plašite algoritama jer verujete matematici i plašite je se. Mnogo grešaka može se desiti kada slepo verujemo u masovne podatke.

This is Kiri Soares. She's a high school principal in Brooklyn. In 2011, she told me her teachers were being scored with a complex, secret algorithm called the "value-added model." I told her, "Well, figure out what the formula is, show it to me. I'm going to explain it to you." She said, "Well, I tried to get the formula, but my Department of Education contact told me it was math and I wouldn't understand it."

Ovo je Kiri Soares, direktor srednje škole u Bruklinu. Godine 2011. rekla mi je da njen kolektiv ocenjuju složenim, tajnim algoritmom koji se zove „model dodatne vrednosti“. Rekla sam joj: „Saznaj koja je formula i pokaži mi je. Objasniću ti je.“ Rekla je: „Pokušala sam dobiti formulu, ali osoba iz Ministarstva obrazovanja mi je rekla da je to matematika i da je neću razumeti.“

It gets worse. The New York Post filed a Freedom of Information Act request, got all the teachers' names and all their scores and they published them as an act of teacher-shaming. When I tried to get the formulas, the source code, through the same means, I was told I couldn't. I was denied. I later found out that nobody in New York City had access to that formula. No one understood it. Then someone really smart got involved, Gary Rubinstein. He found 665 teachers from that New York Post data that actually had two scores. That could happen if they were teaching seventh grade math and eighth grade math. He decided to plot them. Each dot represents a teacher.

Postaje sve gore. Njujork Post je podneo zahtev na osnovu zakona o slobodi informacija, dobio imena i ocene svih nastavnika i onda ih objavio kao čin sramoćenja nastavnika. Kada sam probala istim putem doći do formula, do izvornog kôda, rečeno mi je da ne može. Odbijena sam. Posle sam saznala da niko u Njujorku nema podatke o toj formuli. Niko je nije razumeo. Onda se uključio neko veoma bistar, Geri Rubenstajn. Pronašao je 665 nastavnika iz onog članka u Njujork Postu koji zapravo imaju dva rezultata. Ovo se moglo desiti jer su predavali matematiku u sedmom i osmom razredu. Odlučio je da ih ubaci u grafikon. Svaka tačka je nastavnica ili nastavnik. (Smeh)

(Laughter)

Šta je to?

What is that?

(Smeh)

(Laughter)

To se nikako nije trebalo koristiti za individualne procene.

That should never have been used for individual assessment. It's almost a random number generator.

Ovo je kao generator nasumičnih brojeva. (Aplauz)

(Applause)

Ali, korišćeno je.

But it was. This is Sarah Wysocki. She got fired, along with 205 other teachers, from the Washington, DC school district, even though she had great recommendations from her principal and the parents of her kids.

Ovo je Sara Visoki. Otpuštena je kad i 205 drugih nastavnika iz škola vašingtonskog okruga iako je imala odlučne preporuke direktora i roditelja učenika.

I know what a lot of you guys are thinking, especially the data scientists, the AI experts here. You're thinking, "Well, I would never make an algorithm that inconsistent." But algorithms can go wrong, even have deeply destructive effects with good intentions. And whereas an airplane that's designed badly crashes to the earth and everyone sees it, an algorithm designed badly can go on for a long time, silently wreaking havoc.

Znam šta mnogi od vas ovde sada misle, posebno naučnici za podatke, stručnjaci za veštačku inteligenciju. Mislite: „Pa, nikada ne bismo napravili tako nedosledan algoritam.“ Ali, algoritmi mogu pogrešiti, čak imati i duboko destruktivno dejstvo sa dobrom namerama. Dok loše napravljen avion padne na tlo i svi to vide, loše osmišljen algoritam može trajati dugo i potajno i tiho praviti ogromnu štetu.

This is Roger Ailes.

Ovo je Rodžer Ejls.

(Laughter)

(Smeh)

He founded Fox News in 1996. More than 20 women complained about sexual harassment. They said they weren't allowed to succeed at Fox News. He was ousted last year, but we've seen recently that the problems have persisted. That begs the question: What should Fox News do to turn over another leaf?

Osnovao je Foks Njuz 1996. godine. Preko 20 žena žalilo se na seksualno uznemiravanje. Rekle su da im u Foks Njuzu nije dozvoljen uspeh. Izbačen je prošle godine, ali nedavno smo videli da problemi još nisu rešeni. To zahteva da se postavi pitanje šta Foks Njuz treba da uradi da okrene novi list.

Well, what if they replaced their hiring process with a machine-learning algorithm? That sounds good, right? Think about it. The data, what would the data be? A reasonable choice would be the last 21 years of applications to Fox News. Reasonable. What about the definition of success? Reasonable choice would be, well, who is successful at Fox News? I guess someone who, say, stayed there for four years and was promoted at least once. Sounds reasonable. And then the algorithm would be trained. It would be trained to look for people to learn what led to success, what kind of applications historically led to success by that definition. Now think about what would happen if we applied that to a current pool of applicants. It would filter out women because they do not look like people who were successful in the past.

Šta bi se desilo da proces zapošljavanja zamene mašinskim algoritmom koji uči? Zvuči dobro, zar ne? Razmislite o tome. Podaci. Šta bi bi bili podaci? Ima smisla izabrati prijave za Foks Njuz tokom poslednjih 21 godina. Ima smisla. A definicija uspeha? Razuman izbor bio bi, valjda, neko ko je uspešan u Foks Njuzu? Recimo, osoba koja je tamo bila četiri godine i dobila unapređenje makar jednom. Ima smisla. Onda bismo osposobljavali algoritam. Osposobili bismo ga da traži ljude, da uči šta je vodilo ka uspehu, kakve vrste prijava su vremenom vodile ka uspehu u skladu sa tom definicijom. Razmislite sada šta bi se desilo kada bismo to primenili na trenutne kandidate. Izbacilo bi žene jer ne deluju kao osobe koje su bile uspešne u prošlosti.

Algorithms don't make things fair if you just blithely, blindly apply algorithms. They don't make things fair. They repeat our past practices, our patterns. They automate the status quo. That would be great if we had a perfect world, but we don't. And I'll add that most companies don't have embarrassing lawsuits, but the data scientists in those companies are told to follow the data, to focus on accuracy. Think about what that means. Because we all have bias, it means they could be codifying sexism or any other kind of bigotry.

Algoritmi ne popravljaju stvari ako ih samo nonšalantno i slepo primenjujete. Ne popravljaju stvari. Ponavljaju našu praksu iz prošlosti, naše šablone. Automatizuju status kvo. Da živimo u savršenom svetu to bi bilo sjajno, ali ne živimo. Dodaću da većina firmi nema sramne parnice, ali naučnicima za podatke u tim kompanijama rečeno je idu tragom podataka, da paze na tačnost. Razmislite šta to znači. Pošto smo svi pristrasni, to znači da će možda kodifikovati seksizam ili drugu netrpeljivost.

Thought experiment, because I like them: an entirely segregated society -- racially segregated, all towns, all neighborhoods and where we send the police only to the minority neighborhoods to look for crime. The arrest data would be very biased. What if, on top of that, we found the data scientists and paid the data scientists to predict where the next crime would occur? Minority neighborhood. Or to predict who the next criminal would be? A minority. The data scientists would brag about how great and how accurate their model would be, and they'd be right.

Misaoni eksperiment, jer ih volim: jedno društvo, skroz podeljeno - na osnovu rase, svi gradovi, sve opštine - a policiju šaljemo samo u delove gde živi manjina u potrazi za kriminalom. Podaci o hapšenjima bili bi veoma pristrasni. Šta ako bismo, povrh svega, pronašli naučnike za podatke i platili tim naučnicima da predvide mesto sledećeg zločina? Opštine sa manjinama. Ili da predvide ko će sledeći biti kriminalac? Neko iz manjine. Naučnici bi se hvalisali svojim sjajnim i tačnim modelom, i bili bi u pravu.

Now, reality isn't that drastic, but we do have severe segregations in many cities and towns, and we have plenty of evidence of biased policing and justice system data. And we actually do predict hotspots, places where crimes will occur. And we do predict, in fact, the individual criminality, the criminality of individuals. The news organization ProPublica recently looked into one of those "recidivism risk" algorithms, as they're called, being used in Florida during sentencing by judges. Bernard, on the left, the black man, was scored a 10 out of 10. Dylan, on the right, 3 out of 10. 10 out of 10, high risk. 3 out of 10, low risk. They were both brought in for drug possession. They both had records, but Dylan had a felony but Bernard didn't. This matters, because the higher score you are, the more likely you're being given a longer sentence.

Realnost nije tako drastična, ali postoje ozbiljne podele u mnogim malim i velikom gradovima, i imamo mnoštvo dokaza o pristrasnim podacima u sistemu policije i pravosuđa. Mi zapravo predviđamo krizna mesta, mesta gde će se desiti nasilje. I predviđamo, zapravo, pojedinačni kriminalitet, kriminalitet pojedinaca. Novinska organizacija Propablika nedavno je proverila jedan „algoritam ugrožen recidivizmom“ kako ih zovu, koje sudije koriste u presudama na Floridi. Bernard, levo, crnac, dobio je 10 od 10 poena. Dilan, desno, 3 od 10. Deset od deset, visok rizik. Tri od deset, nizak rizik. Obojica su privedeni zbog posedovanja droge. Obojica su imali dosije, ali Dilan je imao krivično delo, a Bernard nije. Ovo je bitno jer što su ti veći poeni, veće su šanse da dobiješ dužu kaznu.

What's going on? Data laundering. It's a process by which technologists hide ugly truths inside black box algorithms and call them objective; call them meritocratic. When they're secret, important and destructive, I've coined a term for these algorithms: "weapons of math destruction."

O čemu se ovde radi? Pranje podataka. Proces kojim tehnolozi sakrivaju ružnu istinu u crne kutije algoritama i nazivaju ih objektivnima; nazivaju ih meritokratskim. Za tajne, važne i destruktivne algoritme sam skovala frazu „oružje za matematičko uništenje“. (Smeh)

(Laughter)

(Aplauz)

(Applause)

Oni su svuda i to nije greška.

They're everywhere, and it's not a mistake. These are private companies building private algorithms for private ends. Even the ones I talked about for teachers and the public police, those were built by private companies and sold to the government institutions. They call it their "secret sauce" -- that's why they can't tell us about it. It's also private power. They are profiting for wielding the authority of the inscrutable. Now you might think, since all this stuff is private and there's competition, maybe the free market will solve this problem. It won't. There's a lot of money to be made in unfairness.

To su privatne kompanije koje prave privatne algoritme za privatne ciljeve. Čak i one već spomenute, za nastavnike i policiju, napravile su privatne kompanije i zatim ih prodale vladinim telima. Zovu ih „tajnim umakom“; zato nam ništa ne mogu reći o tome. To je i privatna moć. Zarađuju na korišćenju autoriteta koji se ne može proveriti. Možda ste pomislili da, pošto je ovo privatno, postoji konkurencija; možda će slobodno tržište rešiti problem. Neće. Mnogo se novca može napraviti nepravdom.

Also, we're not economic rational agents. We all are biased. We're all racist and bigoted in ways that we wish we weren't, in ways that we don't even know. We know this, though, in aggregate, because sociologists have consistently demonstrated this with these experiments they build, where they send a bunch of applications to jobs out, equally qualified but some have white-sounding names and some have black-sounding names, and it's always disappointing, the results -- always.

Uz to, mi nismo ekonomski racionalni činioci. Svi smo pristrasni. Svi smo rasisti i netrpeljivi onako kako ne želimo biti u oblicima koje i ne poznajemo. Ipak, znamo da je to kolektivno jer to sociolozi dosledno dokazuju eksperimentima koje osmišljavaju, kada pošalju gomilu prijava za posao, podjednako dobrih, ali neke imaju imena koja zvuče belački a neke koje zvuče kao crnački, i uvek su razočaravajući rezultati; uvek.

So we are the ones that are biased, and we are injecting those biases into the algorithms by choosing what data to collect, like I chose not to think about ramen noodles -- I decided it was irrelevant. But by trusting the data that's actually picking up on past practices and by choosing the definition of success, how can we expect the algorithms to emerge unscathed? We can't. We have to check them. We have to check them for fairness.

Tako, mi smo pristrasni i mi u algoritme ubacujemo pristrasnost izborom podataka za prikupljanje, kao kada sam odlučila da ne mislim o instant-špagetama; odlučila sam da su nebitne. Ako verujemo podacima koji otkrivaju praksu iz prošlosti i biramo definiciju uspeha, kako onda očekujemo da algoritmi ostanu neoštećeni? Ne možemo. Moramo ih proveriti. Moramo proveriti da li su pravični.

The good news is, we can check them for fairness. Algorithms can be interrogated, and they will tell us the truth every time. And we can fix them. We can make them better. I call this an algorithmic audit, and I'll walk you through it.

Dobra vest jeste da možemo proveriti jesu li pravični. Algoritme možemo ispitati i reći će nam istinu svaki put. I možemo ih popraviti. Možemo ih poboljšati. To zovem revizijom algoritma i ukratko ću vam je objasniti.

First, data integrity check. For the recidivism risk algorithm I talked about, a data integrity check would mean we'd have to come to terms with the fact that in the US, whites and blacks smoke pot at the same rate but blacks are far more likely to be arrested -- four or five times more likely, depending on the area. What is that bias looking like in other crime categories, and how do we account for it?

Prvo, provera integriteta podataka. Zbog algoritma rizika od recidivizma o kojem sam govorila, provera integriteta podataka značila bi prihvatanje činjenice da u SAD crnci i belci podjednako puše travu ali crnci imaju mnogo više šanse da budu uhapšeni - četiri ili pet puta, zavisi od kraja. Kako ta pristrasnost izgleda u drugim kriminalnim oblastima, i kako je uzimamo u obzir?

Second, we should think about the definition of success, audit that. Remember -- with the hiring algorithm? We talked about it. Someone who stays for four years and is promoted once? Well, that is a successful employee, but it's also an employee that is supported by their culture. That said, also it can be quite biased. We need to separate those two things. We should look to the blind orchestra audition as an example. That's where the people auditioning are behind a sheet. What I want to think about there is the people who are listening have decided what's important and they've decided what's not important, and they're not getting distracted by that. When the blind orchestra auditions started, the number of women in orchestras went up by a factor of five.

Drugo, treba da razmislimo o definiciji uspeha, da je revidiramo. Setite se algoritma za zapošljavanje koji smo spomenuli. Osoba koja je tu četiri godine i unapređena je jednom? Pa, to je uspešan zaposleni, ali je takođe i zaposleni u skladu sa njihovom kulturom. Tako i to može biti pristrasno. Moramo razdvojiti te dve stvari. Treba da uzmemo slepe audicije za orkestar kao primer. Tamo ljudi konkurišu su iza zastora. Što je meni bitno jeste da ljudi koji slušaju odlučuju šta je bitno i odlučuju su šta nije bitno, tako da im to ne odvlači pažnju. Otkad su počele slepe audicije za orkestre, broj žena u orkestrima povećao se pet puta.

Next, we have to consider accuracy. This is where the value-added model for teachers would fail immediately. No algorithm is perfect, of course, so we have to consider the errors of every algorithm. How often are there errors, and for whom does this model fail? What is the cost of that failure?

Zatim, moramo razmotriti tačnost. Tada bi se model dodatne vrednosti za nastavnike odmah raspao. Nema savršenog algoritma, naravno, pa moramo razmotriti greške svakog algoritma. Koliko su greške česte i za koga ovaj model ne funkcioniše? Koja je cena te nefunkcionalnosti?

And finally, we have to consider the long-term effects of algorithms, the feedback loops that are engendering. That sounds abstract, but imagine if Facebook engineers had considered that before they decided to show us only things that our friends had posted.

Na kraju, moramo razmotriti dugoročne efekte algoritama, njihove povratne kružne sprege koje se stvaraju. Ovo zvuči apstraktno, ali zamislite da su Fejsbukovi inženjeri to uzeli u obzir pre odluke da nam prikažu samo postove naših prijatelja.

I have two more messages, one for the data scientists out there. Data scientists: we should not be the arbiters of truth. We should be translators of ethical discussions that happen in larger society.

Imam još dve poruke, jednu za naučnike koji se bave podacima. Naučnici za podatke - ne treba da budemo sudije istine. Treba da budemo prevodioci etičkih rasprava koje se odvijaju u širem društvu.

(Applause)

(Aplauz)

And the rest of you, the non-data scientists: this is not a math test. This is a political fight. We need to demand accountability for our algorithmic overlords.

A za vas ostale, koji niste naučnici za podatke: ovo nije test iz matematike. Ovo je politička borba. Od naših algoritamskih vladara moramo zahtevati odgovornost.

(Applause)

(Aplauz)

The era of blind faith in big data must end.

Doba slepe vere u masovne podatke mora se okončati.

Thank you very much.

Hvala vam mnogo.

(Applause)

(Aplauz)

(Laughter)

(Smeh)

(Laughter)

Šta je to?

What is that?

(Smeh)

(Laughter)

To se nikako nije trebalo koristiti za individualne procene.

That should never have been used for individual assessment. It's almost a random number generator.

Ovo je kao generator nasumičnih brojeva. (Aplauz)

(Applause)

Ali, korišćeno je.

Ovo je Sara Visoki. Otpuštena je kad i 205 drugih nastavnika iz škola vašingtonskog okruga iako je imala odlučne preporuke direktora i roditelja učenika.

This is Roger Ailes.

Ovo je Rodžer Ejls.

(Laughter)

(Smeh)

(Laughter)

(Aplauz)

(Applause)

Oni su svuda i to nije greška.

(Applause)

(Aplauz)

And the rest of you, the non-data scientists: this is not a math test. This is a political fight. We need to demand accountability for our algorithmic overlords.

A za vas ostale, koji niste naučnici za podatke: ovo nije test iz matematike. Ovo je politička borba. Od naših algoritamskih vladara moramo zahtevati odgovornost.

(Applause)

(Aplauz)

The era of blind faith in big data must end.

Doba slepe vere u masovne podatke mora se okončati.

Thank you very much.

Hvala vam mnogo.

(Applause)

(Aplauz)

Cathy O'Neil: The era of blind faith in big data must end

Cathy O'Neil: The era of blind faith in big data must end

Related talks

Tricia Wang: The human insights missing from big data

Mona Chalabi: 3 ways to spot a bad statistic

Mallory Freeman: Your company's data could help end world hunger

Christian Rudder: Inside OKCupid: The math of online dating

Zeynep Tufekci: Machine intelligence makes human morals more important

Amy Webb: How I hacked online dating

Related talks

Tricia Wang: The human insights missing from big data

Mona Chalabi: 3 ways to spot a bad statistic

Mallory Freeman: Your company's data could help end world hunger

Christian Rudder: Inside OKCupid: The math of online dating

Zeynep Tufekci: Machine intelligence makes human morals more important

Amy Webb: How I hacked online dating