Now, I haven't met most of you or really any of you, but I feel a really good vibe in the room.
(Laughter)
And so I think I'd like to treat you all to a meal. What do you think? Yes? Great, so many new friends. So we're going to go to this cafe, they serve sandwiches. And the sandwiches are really delicious. But I have to tell you that sometimes they make people really, really sick.
(Laughter)
And we don't know why. Because the cafe won't tell us how they make the sandwich, they won't tell us about the ingredients. And then the authorities have no way to fix the problem. But the offer still stands. So who wants to get a sandwich?
(Laughter)
Some brave souls, we can talk after. But for the rest of you, I understand. You don't have enough information to make good choices about your safety or even fix the issue.
Now, before I further the anxiety here, I'm not actually trying to make you sick, but this is an analogy to how we're currently making algorithmic systems, also known as artificial intelligence or AI.
Now, for those who haven't thought about the relationship between AI and sandwiches, don't worry about it, I'm here for you, I'm going to explain. You see, AI systems, they provide benefit to society. They feed us, but they're also inconsistently making us sick. And we don't have access to the ingredients that go into the AI. And so we can't actually address the issues. We also can't stop eating AI like we can just stop eating a shady sandwich because it's everywhere, and we often don't even know that we're encountering a system that's algorithmically based.
So today, I'm going to tell you about some of the AI trends that I see. I'm going to draw on my experience building these systems over the last two decades to tell you about the tools that I and others have built to look into these AI ingredients. And finally, I'm going to leave you with three principles that I think will give us a healthier relationship to the companies that build artificial intelligence.
I'm going to start with the question, how did we get here? AI is not new. We have been living alongside AI for two decades. Every time that you apply for something online, you open a bank account or you go through passport control, you're encountering an algorithmic system. We've also been living with the negative repercussions of AI for 20 years, and this is how it makes us sick. These systems get deployed on broad populations, and then certain subsets end up getting negatively disparately impacted, usually on the basis of race or gender or other characteristics. We need to be able to understand the ingredients to these systems so that we can address the issues.
So what are the ingredients to an AI system? Well, data fuels the AI. The AI is going to look like the data that you gave it. So for example, if I want to make a risk-assessment system for diabetes, my training data set might be adults in a certain region. And so I'll build that system, it'll work really well for those adults in that region. But it does not work for adults in other regions or maybe at all for children. So you can imagine if we deploy this for all those populations, there are going to be a lot of people who are harmed. We need to be able to understand the quality of the data before we use it.
But I'm sorry to tell you that we currently live in what I call the Wild West of data. It's really hard to assess quality of data before you use it. There are no global standards for data quality assessment, and there are very few data regulations around how you can use data and what types of data you can use.
This is kind of like in the food safety realm. If we couldn't understand where the ingredients were sourced, we also had no idea whether they were safe for us to consume. We also tend to stitch data together, and every time we stitch this data together, which we might find on the internet, scrape, we might generate it, we could source it. We lose information about the quality of the data. And the folks who are building the models are not the ones that found the data. So there's further information that's lost.
Now, I've been asking myself a lot of questions about how can we understand the data quality before we use it. And this emerges from two decades of building these kinds of systems. The way I was trained to build systems is similar to how people do it today. You build for the middle of the distribution. That's your normal user. So for me, a lot of my training data sets would include information about people from the Western world who speak English, who have certain normative characteristics. And it took me an embarrassingly long amount of time to realize that I was not my own user. So I identify as non-binary, as mixed race, I wear a hearing aid and I just wasn't represented in the data sets that I was using. And so I was building systems that literally didn't work for me. And for example, I once built a system that repeatedly told me that I was a white Eastern-European lady. This did a real number on my identity.
(Laughter)
But perhaps even more worrying, this was a system to be deployed in health care, where your background can determine things like risk scores for diseases. And so I started to wonder, can I build tools and work with others to do this so that I can look inside of a dataset before I use it?
In 2018, I was part of a fellowship at Harvard and MIT, and I, with some colleagues, decided to try to address this problem. And so we launched the Data Nutrition Project, which is a research group and also a nonprofit that builds nutrition labels for datasets. So similar to food nutrition labels, the idea here is that you can look inside of a data set before you use it. You can understand the ingredients, see whether it's healthy for the things that you want to do.
Now this is a cartoonified version of the label. The top part tells you about the completion of the label itself. And underneath that you have information about the data, the description, the keywords, the tags, and importantly, on the right hand side, how you should and should not use the data. If you could scroll on this cartoon, you would see information about risks and mitigation strategies across a number of vectors.
And we launched this with two audiences in mind. The first audience are folks who are building AI. So they’re choosing datasets. We want to help them make a better choice. The second audience are folks who are building datasets. And it turns out that when you tell someone they have to put a label on something, they think about the ingredients beforehand. The analogy here might be, if I want to make a sandwich and say that it’s gluten-free, I have to think about all the components as I make the sandwich, the bread and the ingredients, the sauces. I can't just put it on a sandwich and put it in front of you and tell you it's gluten-free.
We're really proud of the work that we've done. We launched this as a design and then a prototype and ultimately a tool for others to make their own labels. And we've worked with experts at places like Microsoft Research, the United Nations and professors globally to integrate the label and the methodology into their work flows and into their curricula.
But we know it only goes so far. And that's because it's actually really hard to get a label on every single dataset. And this comes down to the question of why would you put a label on a dataset to begin with?
Well, the first reason is not rocket science. It's that you have to. And this is, quite frankly, why food nutrition labels exist. It's because if they didn't put them on the boxes, it would be illegal. However, we don't really have AI regulation. We don't have much regulation around the use of data. Now there is some on the horizon. For example, the EU AI Act just passed this week. And although there are no requirements around making the training data available, they do have provisions for creating transparency labeling like the dataset nutrition label, data sheets, data statements. There are many in the space. We think this is a really good first step.
The second reason that you might have a label on a dataset is because it is a best practice or a cultural norm. The example here might be how we're starting to see more and more food packaging and menus at restaurants include information about whether there's gluten. This is not required by law, although if you do say it, it had better be true. And the reason that people are adding this to their menus and their food packaging is because there's an increased awareness of the sensitivity and kind of the seriousness of that kind of an allergy or condition. So we're also seeing some movement in this area. Folks who are building datasets are starting to put nutrition labels, data sheets on their datasets. And people who are using data are starting to request the information. This is really heartening.
And you might say, "Kasia, why are you up here? Everything seems to be going well, seems to be getting better." In some ways it is. But I'm also here to tell you that our relationship to data is getting worse. Now the last few years have seen a supercharged interest in gathering datasets. Companies are scraping the web. They're transcribing millions of hours of YouTube videos into text. By some estimates, they'll run out of information on the internet by 2026. They're even considering buying publishing houses so they can get access to printed text and books.
So why are they gathering this information? Well, they need more and more information to train a new technique called generative AI. I want to tell you about the size of these datasets. If you look at GPT-3, which is a model that launched in 2020, the training dataset included 300 billion words, or parts of words. Now for context, the English language contains less than a million words. Just three years later, DBRX was launched, which was trained on eight trillion words. So 300 billion to eight trillion in three years. And the datasets are getting bigger. Now with each successive model launch, the datasets are actually less and less transparent. And even we have access to the information, it's so big, it's so hard to look inside without any kind of transparency tooling. And the generative AI itself is also causing some worries. And you've probably encountered this technique through ChatGPT. I don't need to know what you do on the internet, that's between you and the internet, but you probably know, just like I do, how easy it is to create information using ChatGPT and other generative AI technologies and to put that out onto the web. And so we're looking at a situation in which we're going to encounter lots of information that's algorithmically generated but we won't know it and we won't know whether it's true. And this increases the scale of the potential risks and harms from AI.
Not only that, I'm sorry, but the models themselves are getting controlled by a smaller and smaller number of private actors in US tech firms. So this is the models that were launched last year, in 2023. And you can see most of them are pink, meaning they came out of industry. And if you look at this over time, more and more are coming out of industry and fewer and fewer are coming out of all the other sectors combined, including academia and government, where technology is often launched in a way that's more easy to be scrutinized.
So if we go back to our cafe analogy, this is like you have a small number of private actors who own all the ingredients, they make all the sandwiches globally, and there's not a lot of regulation. And so at this point you're probably scared and maybe feeling a little uncomfortable. Which is ironic because a few minutes ago, I was going to get you all sandwiches and you said yes. This is why you should not accept food from strangers.
But I wouldn't be up here if I weren't also optimistic. And that's because I think we have momentum behind the regulation and the culture changes. Especially if we align ourselves with three basic principles about how corporations should engage with data. The first principle is that companies that gather data should tell us what they're gathering. This would allow us to ask questions like, is it copyrighted material? Is that information private? Could you please stop? It also opens up the data to scientific inquiry.
The second principle is that companies that are gathering our data should tell us what they're going to do with it before they do anything with it. And by requiring that companies tell us their plan, this means that they have to have a plan, which would be a great first step. It also probably would lead to the minimization of data capture, because they wouldn't be able to capture data if they didn't know what they were already going to do with it.
And finally, principle three, companies that build AI should tell us about the data that they use to train the AI. And this is where dataset nutrition labels and other transparency labeling comes into play. You know, in the case where the data itself won't be made available, which is most of the time, probably, the labeling is critical for us to be able to investigate the ingredients and start to find solutions.
So I want to leave you with the good news, and that is that the data nutrition projects and other projects are just a small part of a global movement towards AI accountability. Dataset Nutrition Label and other projects are just a first step. Regulation's on the horizon, the cultural norms are shifting, especially if we align with these three basic principles that companies should tell us what they're gathering, tell us what they're going to do with it before they do anything with it, and that companies that are building AI should explain the data that they're using to build the system.
We need to hold these organizations accountable for the AI that they're building by asking them, just like we do with the food industry, what's inside and how did you make it? Only then can we mitigate the issues before they occur, as opposed to after they occur. And in doing so, create an integrated algorithmic internet that is healthier for everyone.
Thank you.
(Applause)