Jennifer Golbeck: Your social media "likes" expose more than you think

If you remember that first decade of the web, it was really a static place. You could go online, you could look at pages, and they were put up either by organizations who had teams to do it or by individuals who were really tech-savvy for the time. And with the rise of social media and social networks in the early 2000s, the web was completely changed to a place where now the vast majority of content we interact with is put up by average users, either in YouTube videos or blog posts or product reviews or social media postings. And it's also become a much more interactive place, where people are interacting with others, they're commenting, they're sharing, they're not just reading.

De eerste tien jaar was internet een nogal statische plek. Je kon online gaan en pagina's bekijken die erop gezet waren door organisaties, waar teams dat deden, of door mensen die voor die tijd technisch onderlegd waren. Met de groei van sociale media en sociale netwerken in de beginjaren 2000, veranderde het internet volkomen tot een plek waar nu de meeste inhoud van het internet door gemiddelde gebruikers wordt geleverd. Zij het in YouTube-video's of weblogs of productbesprekingen of in sociale media. Het wordt ook een steeds interactievere plek, waar mensen interactief zijn met elkaar. Ze geven commentaar en delen, ze lezen niet alleen maar.

So Facebook is not the only place you can do this, but it's the biggest, and it serves to illustrate the numbers. Facebook has 1.2 billion users per month. So half the Earth's Internet population is using Facebook. They are a site, along with others, that has allowed people to create an online persona with very little technical skill, and people responded by putting huge amounts of personal data online. So the result is that we have behavioral, preference, demographic data for hundreds of millions of people, which is unprecedented in history. And as a computer scientist, what this means is that I've been able to build models that can predict all sorts of hidden attributes for all of you that you don't even know you're sharing information about. As scientists, we use that to help the way people interact online, but there's less altruistic applications, and there's a problem in that users don't really understand these techniques and how they work, and even if they did, they don't have a lot of control over it. So what I want to talk to you about today is some of these things that we're able to do, and then give us some ideas of how we might go forward to move some control back into the hands of users.

Facebook is er niet de enige plek voor, maar wel de grootste. De cijfers spreken boekdelen. Facebook heeft 1,2 miljard gebruikers per maand. Dus de helft van alle internetgebruikers gebruikt Facebook. Op deze website, net als op andere, kunnen mensen zonder veel technische kennis een online persona creëren. Men reageerde daarop door enorm veel persoonlijke gegevens online te zetten. Het gevolg is dat we gedrag, voorkeuren en demografische gegevens van honderden miljoenen mensen kennen, wat ongeëvenaard is in de geschiedenis. Als computerwetenschapper kan ik daardoor modellen maken die allerlei verborgen kenmerken over je kunnen voorspellen waarvan jullie zelf niet eens weten dat je er informatie over deelt. Als wetenschappers gebruiken we dat om mensen te helpen bij hun online gedrag. Maar er zijn ook minder menslievende toepassingen. Het probleem is dat gebruikers de werking van deze technieken niet goed snappen. Zelfs dan nog hebben ze er niet veel controle over. Ik heb het vandaag over dingen die we kunnen doen om te begrijpen hoe we vooruit kunnen gaan en die controle kunnen teruggeven aan de gebruiker.

So this is Target, the company. I didn't just put that logo on this poor, pregnant woman's belly. You may have seen this anecdote that was printed in Forbes magazine where Target sent a flyer to this 15-year-old girl with advertisements and coupons for baby bottles and diapers and cribs two weeks before she told her parents that she was pregnant. Yeah, the dad was really upset. He said, "How did Target figure out that this high school girl was pregnant before she told her parents?" It turns out that they have the purchase history for hundreds of thousands of customers and they compute what they call a pregnancy score, which is not just whether or not a woman's pregnant, but what her due date is. And they compute that not by looking at the obvious things, like, she's buying a crib or baby clothes, but things like, she bought more vitamins than she normally had, or she bought a handbag that's big enough to hold diapers. And by themselves, those purchases don't seem like they might reveal a lot, but it's a pattern of behavior that, when you take it in the context of thousands of other people, starts to actually reveal some insights. So that's the kind of thing that we do when we're predicting stuff about you on social media. We're looking for little patterns of behavior that, when you detect them among millions of people, lets us find out all kinds of things.

Dit is Target, het bedrijf. Ik heb dat logo niet zomaar op de buik gedaan van deze arme zwangere vrouw. Misschien heb je het verhaal gelezen in het tijdschrift Forbes over Target, die een folder stuurde naar een 15-jarig meisje met advertenties en tegoedbonnen voor babyflesjes en luiers en bedjes, twee weken voor ze haar ouders vertelde dat ze zwanger was. Tja, de vader was nogal overstuur. Hij zei: "Hoe weet Target dat dit schoolkind zwanger was voordat ze het haar ouders vertelde?" Het bleek dat ze koopgegevens hebben van honderdduizenden klanten. Ze berekenen wat ze noemen een zwangerschaps-score. Niet óf een vrouw zwanger is maar wanneer ze uitgeteld is. Ze berekenden dat niet aan de hand van bekende dingen zoals het kopen van een bedje of kleertjes, maar omdat ze bijvoorbeeld meer vitaminen kocht dan normaal, of ze kocht een handtas die groot genoeg is voor luiers. En elk op zichzelf onthult dat niet veel maar als je deze gedragspatronen bij duizenden mensen bekijkt, geeft dit daadwerkelijk bepaalde inzichten. Dat zijn we aan het doen als we dingen voorspellen over jou en sociale media. We kijken naar subtiele gedragspatronen die, als je dat ziet bij miljoenen mensen, allerlei dingen onthullen. In mijn lab met mijn collega's,

So in my lab and with colleagues, we've developed mechanisms where we can quite accurately predict things like your political preference, your personality score, gender, sexual orientation, religion, age, intelligence, along with things like how much you trust the people you know and how strong those relationships are. We can do all of this really well. And again, it doesn't come from what you might think of as obvious information.

hebben we bedacht hoe we dingen aardig kunnen voorspellen zoals je politieke voorkeur, je persoonlijkheid, aard, seksuele voorkeur, geloof, leeftijd, intelligentie, en nog veel meer: in hoeverre je je kennissen vertrouwt en hoe sterk je relatie met hen is. We kunnen dat behoorlijk goed. Het komt niet van wat je zou denken, uit voor de hand liggende informatie. Mijn favoriete voorbeeld is van een onderzoek

So my favorite example is from this study that was published this year in the Proceedings of the National Academies. If you Google this, you'll find it. It's four pages, easy to read. And they looked at just people's Facebook likes, so just the things you like on Facebook, and used that to predict all these attributes, along with some other ones. And in their paper they listed the five likes that were most indicative of high intelligence. And among those was liking a page for curly fries. (Laughter) Curly fries are delicious, but liking them does not necessarily mean that you're smarter than the average person. So how is it that one of the strongest indicators of your intelligence is liking this page when the content is totally irrelevant to the attribute that's being predicted? And it turns out that we have to look at a whole bunch of underlying theories to see why we're able to do this. One of them is a sociological theory called homophily, which basically says people are friends with people like them. So if you're smart, you tend to be friends with smart people, and if you're young, you tend to be friends with young people, and this is well established for hundreds of years. We also know a lot about how information spreads through networks. It turns out things like viral videos or Facebook likes or other information spreads in exactly the same way that diseases spread through social networks. So this is something we've studied for a long time. We have good models of it. And so you can put those things together and start seeing why things like this happen. So if I were to give you a hypothesis, it would be that a smart guy started this page, or maybe one of the first people who liked it would have scored high on that test. And they liked it, and their friends saw it, and by homophily, we know that he probably had smart friends, and so it spread to them, and some of them liked it, and they had smart friends, and so it spread to them, and so it propagated through the network to a host of smart people, so that by the end, the action of liking the curly fries page is indicative of high intelligence, not because of the content, but because the actual action of liking reflects back the common attributes of other people who have done it.

dat dit jaar stond in het tijdschrift PNAS. Googel maar eens. Vier pagina's, goed leesbaar. Ze keken alleen naar de 'likes' op Facebook. Dus: dingen die je leuk vindt op Facebook. Ze gebruikten dat om die kenmerken te voorspellen en nog wat andere dingen. In hun scriptie noemden ze de vijf 'likes' die het sterkst wezen op een hoge intelligentie. Daartussen stond ook het 'liken' van krulfriet. (Gelach) Krulfriet is heerlijk, maar ervan houden betekent niet per se dat je slimmer bent dan gemiddeld. Hoe kan het dan dat een van de sterkste aanwijzingen over je intelligentie het 'liken' van deze pagina is als de inhoud totaal los staat van het kenmerk dat voorspeld wordt? Het blijkt dat we moeten kijken naar onderliggende theorieën om te zien hoe we dat kunnen doen. Een ervan is een sociologische theorie die homofilie heet. Je bent bevriend met mensen zoals jezelf. Slimme mensen hebben slimme vrienden. Als je jong bent heb je jonge vrienden. Dat is al eeuwen ingeburgerd. We weten ook veel over hoe informatie wordt verspreid in netwerken. Het blijkt dat virale video's of likes op Facebook en andere informatie zich op precies dezelfde manier verspreiden als ziektes in een sociaal netwerk. Dat hebben we lang bestudeerd. We hebben er goede modellen voor. Je kan die dingen dus naast elkaar zetten en zien waarom zulke dingen gebeuren. Mijn hypothese is dat een slim iemand deze pagina is begonnen, of dat één van de eersten die het 'liketen' hoog scoorde in die test. Zij 'liketen' het en hun vrienden zagen dat, en door homofilie weten we dat hij slimme vrienden zal hebben. Zo kregen zij het te zien en sommigen 'liketen' het en zo kwam het bij hún slimme vrienden, en verspreidde zich via dit netwerk naar heel veel slimme mensen zodat aan het einde het 'liken' van de krulfriet-pagina intelligentie impliceert. Niet vanwege de inhoud maar omdat de handeling van het 'liken' de bekende eigenschappen teruggeeft van anderen die dat hebben gedaan.

So this is pretty complicated stuff, right? It's a hard thing to sit down and explain to an average user, and even if you do, what can the average user do about it? How do you know that you've liked something that indicates a trait for you that's totally irrelevant to the content of what you've liked? There's a lot of power that users don't have to control how this data is used. And I see that as a real problem going forward.

Dat is nogal ingewikkeld hè? Het is moeilijk uit te leggen aan de gemiddelde gebruiker, en al leg je het uit, wat kan die gemiddelde gebruiker eraan doen? Hoe weet je dat een 'like' iets impliceert over jou, dat helemaal los staat van de inhoud die je 'liket'? Er is veel macht die gebruikers niet hebben om te bepalen hoe de gegevens worden gebruikt. Ik zie dat als een probleem dat steeds groter wordt.

So I think there's a couple paths that we want to look at if we want to give users some control over how this data is used, because it's not always going to be used for their benefit. An example I often give is that, if I ever get bored being a professor, I'm going to go start a company that predicts all of these attributes and things like how well you work in teams and if you're a drug user, if you're an alcoholic. We know how to predict all that. And I'm going to sell reports to H.R. companies and big businesses that want to hire you. We totally can do that now. I could start that business tomorrow, and you would have absolutely no control over me using your data like that. That seems to me to be a problem.

Ik denk dat we een aantal richtingen op kunnen als we de gebruikers willen laten bepalen hoe die gegevens worden gebruikt, want het wordt niet altijd in hun voordeel gebruikt. Ik zeg vaak: als professor zijn me gaat vervelen, ga ik een bedrijf beginnen dat eigenschappen voorspelt. Dingen als teamwork en of drugs gebruikt, of je alcoholist bent. We weten hoe je dat voorspelt. Ik ga verslagen verkopen aan HR-firma's en grote ondernemingen die je willen inhuren. We kunnen dat allemaal al. Ik kan dat bedrijf morgen beginnen en jij zou niet kunnen voorkomen dat ik jouw gegevens zo gebruik. Dat lijkt me wel een probleem.

So one of the paths we can go down is the policy and law path. And in some respects, I think that that would be most effective, but the problem is we'd actually have to do it. Observing our political process in action makes me think it's highly unlikely that we're going to get a bunch of representatives to sit down, learn about this, and then enact sweeping changes to intellectual property law in the U.S. so users control their data.

Eén mogelijkheid is dat we ons richten op beleid en wetgeving. In sommige opzichten zou dat het effectiefst zijn maar het probleem is om het daadwerkelijk te doen. Als ik kijk naar de politiek dan lijkt me het niet waarschijnlijk dat we vertegenwoordigers krijgen die zich hier grondig over informeren en dan het intellectueel eigendomsrecht ingrijpend gaan veranderen, zodat gebruikers controle krijgen over hun gegevens. Je kan het beleidsmatig doen,

We could go the policy route, where social media companies say, you know what? You own your data. You have total control over how it's used. The problem is that the revenue models for most social media companies rely on sharing or exploiting users' data in some way. It's sometimes said of Facebook that the users aren't the customer, they're the product. And so how do you get a company to cede control of their main asset back to the users? It's possible, but I don't think it's something that we're going to see change quickly.

als sociale mediabedrijven jou volledige controle geven over je gegevens. Maar de verdienmodellen van de meeste sociale mediabedrijven berusten op het delen of uitbaten van de gebruikersgegevens. Ze zeggen van Facebook dat de gebruikers niet de klant zijn, maar het product. Hoe krijg je dan een bedrijf zover dat ze de macht over hun belangrijkste troef teruggeven aan de klant? Het is mogelijk, maar ik denk niet dat we dat snel zullen zien veranderen.

So I think the other path that we can go down that's going to be more effective is one of more science. It's doing science that allowed us to develop all these mechanisms for computing this personal data in the first place. And it's actually very similar research that we'd have to do if we want to develop mechanisms that can say to a user, "Here's the risk of that action you just took." By liking that Facebook page, or by sharing this piece of personal information, you've now improved my ability to predict whether or not you're using drugs or whether or not you get along well in the workplace. And that, I think, can affect whether or not people want to share something, keep it private, or just keep it offline altogether. We can also look at things like allowing people to encrypt data that they upload, so it's kind of invisible and worthless to sites like Facebook or third party services that access it, but that select users who the person who posted it want to see it have access to see it. This is all super exciting research from an intellectual perspective, and so scientists are going to be willing to do it. So that gives us an advantage over the law side.

Ik denk dat het op een effectievere manier kan door middel van meer wetenschap. Door wetenschap te bedrijven konden we de mechanismes ontwikkelen om persoonlijke gegevens te berekenen. Dat is bijna hetzelfde onderzoek dat we zouden moeten doen als we mechanismes willen ontwikkelen die een gebruiker zeggen: "Dit is het risico van wat je zojuist deed." Doordat jij deze Facebook-pagina 'liket', of deze persoonlijke informatie deelt, kan ik nu beter voorspellen of je drugs gebruikt en of je populair bent op je werk. Dat kan volgens mij beïnvloeden of mensen wel of niet iets willen delen, het afgeschermd houden of het offline houden. Ook zouden we mensen in staat kunnen stellen hun geüploade data te versleutelen zodat die onzichtbaar en waardeloos wordt voor sites als Facebook of derde partijen die erbij kunnen. Dan bepaalt degene die het gepost heeft welke gebruikers toegang hebben. Dat is allemaal erg spannend onderzoek vanuit intellectueel oogpunt, dus doen wetenschappers het graag. Dat geeft ons een voordeel ten opzichte van de juridische optie.

One of the problems that people bring up when I talk about this is, they say, you know, if people start keeping all this data private, all those methods that you've been developing to predict their traits are going to fail. And I say, absolutely, and for me, that's success, because as a scientist, my goal is not to infer information about users, it's to improve the way people interact online. And sometimes that involves inferring things about them, but if users don't want me to use that data, I think they should have the right to do that. I want users to be informed and consenting users of the tools that we develop.

Een probleem dat mensen aandragen als ik hierover praat is: "Als mensen die gegevens voor zichzelf houden, werken al jouw methodes om hun eigenschappen te voorspellen niet meer." Ik zeg dan: "Jazeker, en dat noem ik succes, want als wetenschapper is mijn doel niet om informatie over gebruikers te krijgen, maar om de interactie online te verbeteren. Soms houdt dat in dat je gegevens van hen achterhaalt, maar als gebruikers niet willen dat ik die gebruik, moeten ze dat recht hebben. Ik wil dat gebruikers geïnformeerd zijn en instemmen met de tools die we maken.

And so I think encouraging this kind of science and supporting researchers who want to cede some of that control back to users and away from the social media companies means that going forward, as these tools evolve and advance, means that we're going to have an educated and empowered user base, and I think all of us can agree that that's a pretty ideal way to go forward.

Deze manier van wetenschap aanmoedigen en onderzoekers ondersteunen die wat zeggenschap teruggeven aan de gebruikers ten koste van de sociale mediabedrijven, betekent dat we met de verbeteringen van deze tools ook beter onderlegde gebruikers met betere middelen krijgen. Ik denk dat we dat allemaal een vrij ideale weg voorwaarts vinden.

Thank you.

Dank je wel.

(Applause)

(Applaus)

Thank you.

Dank je wel.

(Applause)

(Applaus)

Jennifer Golbeck: Your social media "likes" expose more than you think

Jennifer Golbeck: Your social media "likes" expose more than you think

Related talks

Del Harvey: Protecting Twitter users (sometimes from themselves)

Johanna Blakley: Social media and the end of gender

Juan Enriquez: Your online life, permanent as a tattoo

Susan Etlinger: What do we do with all this big data?

Tamas Kocsis: The case for a decentralized internet

Zeynep Tufekci: We're building a dystopia just to make people click on ads

Related talks

Del Harvey: Protecting Twitter users (sometimes from themselves)

Johanna Blakley: Social media and the end of gender

Juan Enriquez: Your online life, permanent as a tattoo

Susan Etlinger: What do we do with all this big data?

Tamas Kocsis: The case for a decentralized internet

Zeynep Tufekci: We're building a dystopia just to make people click on ads