Skip to page content

Human Intelligence: Wikipedia CTO Selena Deckelmann on AI and LLM training


Selena Deckelmann
Selena Deckelmann
Courtesy of Selena Deckelmann

Human Intelligence is an occasional series to help PBJ readers better understand how they might use AI to improve their businesses. We'll highlight takeaways and best practices shared by people who are adding AI to their products or who study the technology.

When Portland technologist Selena Deckelmann took the job of CTO and chief product officer at Wikimedia Foundation, ChatGPT and generative AI had yet to take over the zeitgeist.

A year into her job and it’s a different story. Like many organizations, the Wikimedia Foundation is experimenting with tools like ChatGPT. The group released a ChatGT plugin over the summer that connects the AI tool to the foundation's most well-known product, Wikipedia.

However, Deckelmann said the group has been working with AI and machine learning for some time as it looks for ways to help the thousands of Wikipedia editors do the job more effectively. She noted that each month roughly 300,000 edits are made on Wikipedia. Plus, some articles can involve up to 1,000 different people discussing its creation or editing, as was the case for the article about the current Israel-Hamas War.

An AI tool could be effective in helping summarize long discussion threads as editors make decisions and come to consensus, she said. Deckelmann and her team are doing what many business leaders are doing right now, evaluating where these new AI tools make sense. She also has the added weight that, so far, all of the large language models that power these generative AI tools are trained on Wikipedia’s data.

We recently spoke with Deckelmann about AI and where Wikipedia fits in.

As you develop AI tools is there a time when Wikipedia is solely produced by computers? So for us, what Wikipedia is is the result of a process between people where they debate and they curate sources and decide together what knowledge is. So, I think the idea that an AI could replace that, that isn’t the result of a process of discussion, debate or consensus. You could have an AI produce these articles — and there are people that are doing that — but then what you find in them is there’s hallucinations and sources that don’t exist. While I think (AI) will get better down the line, at the end of the day what we are offering is a group of people taking ownership of content and saying this is a thing we produce together. And that is hard to get from a generative AI system. Even if we find places where AI systems create good articles you will still need a human to go through and verify that sources exist, that overall the article makes sense, and it isn’t full of hallucinations. I think that the value of having truly human created content, in this kind of environment where it’s very cheap to create other kinds of text (will be more important.)

These large language models that power generative AI are trained on Wikipedia content. How does the organization work with these other groups? Wikipedia itself has an open license. It has always been openly licensed content free to use for commercial and noncommercial purposes. This isn’t the first time that a business created something extremely valuable using Wikipedia data. We have a (paid) product we have created to make API-based access to the data easier and faster, because there are different ways of accessing the data.

How do you think about the role Wikipedia plays in training? Part of the reason behind the (open) license we chose is it's hard to predict what new ways people will find the knowledge that gets created and new ways to remix it. There’s definitely differences in opinion in our community about what is the right amount of use of this data and how valuable a company can be created using this. But I think for us the heart of it is, if we can find ways of using the knowledge and sharing it with more people that is our mission. The challenge that I see is connecting the spread of the knowledge with encouraging all people to contribute things back to the common spaces that are publicly accessible. I think a real sad turn of events would be if the companies that are creating these kinds of tools don’t find a way to encourage people to contribute back to the systems.

You mean so that AI training doesn’t just become LLMs training on content created by AI? That’s called model collapse. If you feed the output of a generative AI system back into itself enough times then it just stops making sense at all. That’s a real danger. If the internet, for example, which most of the LLMs are being trained on becomes filled with, or the web more specifically, gets filled with content from LLMs they are not going to be able to train on it effectively anymore.

When you took this job the explosion of generative AI hadn’t happened yet, has that changed how you work? There are some core technology issues that I don’t think are really that different. We have principles we think about whenever we are going to deploy a system or prioritize different work. When it comes to AI the No. 1 thing is sustainability. Are we creating a system that is sustainable, does it have the right volunteer input? Is it doing something that truly advances our mission? The second thing is equity. Are we deploying a system that equitably supports the different communities that we are involved with? Is it advancing more equity in the world or is it increasing bias and causing problems for historically underrepresented groups? The last thing is transparency. Are we able to take the model inputs and give the people who are affected by the use of the models meaningful ways to give feedback and influence (the models). Those same principles apply to other technology we deploy, so it’s not entirely new but I think it’s very urgent. It’s urgent that we think about the ways that AI systems are affecting volunteers and the users of Wikipedia.

Where do you see AI being the most successful? I think the future of AI systems that are truly effective and really improve our lives are going to be things that are very targeted, not general. It’s finding some thing to improve. For us, it’s improving how efficient an editor is when they are reviewing a discussion. Improving the quality of translation so (editors) don’t have to dig around a whole bunch to figure out what words work between one language and another. These kinds of things can really make their productivity go way up and make them a lot happier with the output of the work. That’s where I can see how valuable (AI) can be to individuals.


Keep Digging

Inno Insights
Inno Insights
Inno Insights
Inno Insights
Inno Insights


SpotlightMore

A view of the Portland skyline from the east end of the Morrison Bridge. The City Club of Portland will tackle the state of local architecture at its Friday forum this week.
See More
Image via Getty
See More
Image via Getty Images
See More
See More

Want to stay ahead of who & what is next? Sent twice a week, the Beat is your definitive look at Portland’s innovation economy, offering news, analysis & more on the people, companies & ideas driving your city forward. Follow The Beat

Sign Up