AI Policy & Governance, Cybersecurity & Standards
Controlling AI Training Crawlers: Beyond Copyright
AI-powered chatbots draw on the collective work of billions of humans. To respond to the queries users enter into their prompts, ChatGPT, Google Gemini, Microsoft Copilot and other large language models rely on their analysis of trillions of words of human writing posted online. Likewise, image generators couldn’t turn out graphics without analyzing the billions of photos and illustrations we’ve already put on the web. Unless they’ve instead been given a specific, limited store of content to learn from, these systems “crawl” the web to learn enough to perform the tasks people now routinely ask them to do.
When the rights of the humans who created all that content have been discussed, it’s generally been in the context of copyright law. Many AI firms are already facing lawsuits from writers and artists upset that tech companies are profiting from the use or reproduction of their work. But copyright isn’t the only relevant interest, and courts aren’t always the best way to work out complex issues with many stakeholders.
Many people won’t have a copyright claim, for example, but they may still care how their work is used or how information about them is shared. Meanwhile, researchers want the ability to analyze content for scientific and public-interest purposes without getting caught up in disputes about copyright or how AIs are trained. Plus, relying on the legal system to adjudicate an AI’s ability to access content may favor the big players, since they’ll have more resources to fight for their interests in court or establish exclusive licensing
Fortunately, the internet standards-setting process provides an alternative model for working out these issues. Tech standards bodies have a long history of finding solutions to problems with a variety of stakeholders involved, including tech companies large and small, civil society organizations, national governments, researchers, individual users and more. We also have a precedent for giving website owners a technical means to communicate their preferences to control automated attempts to index their content. As search engines began crawling the web to index content three decades ago, an informal, collaborative process yielded a system for websites to use a “robots.txt” file to allow website managers to at least indicate parameters around search-engine crawlers.
With this history in mind, the Internet Architecture Board will convene a workshop in Washington, DC this week, focused on controls on crawling for AI training and the potential of standards work at the Internet Engineering Task Force. Short position papers from potential participants explore the issues involved and potential standards to allow content creators to track or opt out of having their work become part of the training set for an AI model. CDT’s Eric Null will present, to encourage consideration of privacy and other non-copyright interests and inclusion of a breadth of stakeholders in a standard-setting process.
Technical standards are only recommendations and will not by themselves provide an enforcement mechanism. Nevertheless, an updated technical standard could help take steps towards a consensus that takes into account the wishes of the people who’ve created and maintained all that web content that’s being used for chatbots and training other AI tools.
More information: