AI Policy & Governance, CDT Research
Grounding AI Policy: Towards Researcher Access to AI Usage Data
Introduction
A chair is for sitting. A clock is for telling time. To look at these objects is to understand their primary use. Until recently, AI was, in most cases, a similar technology, where design and use were closely linked. A facial recognition system recognized faces, a spellchecker checked spelling. Today though, with the advent of powerful “transformer models,” a single AI application can (at least in appearance) be used to countless ends — to write poetry, evaluate a resume, identify bird species, and diagnose diseases. As possible use cases become broader, so do the potential risks, which now range from the malicious, such as generating propaganda or sexual images of children, to the inadvertent, such as providing misleading election or health information.
With these advances, companies and governments are rapidly integrating AI into new systems and domains (Knight, 2023). In response, policymakers are scrambling to regulate AI in order to mitigate its risks and maximize its potential benefits. This has manifested in a flurry of political activity, which in the US alone includes dozens of proposed federal bills, a small number of state laws and hundreds more state bills, the longest executive order ever issued, and a tide of regulatory guidance.
However, when designing new regulations, policymakers face an empirical dilemma: they must regulate AI without any access to real world data on how people and businesses are using these systems. Unlike social media and the internet, where user behavior is often public and leaves observable data traces, general-purpose AI systems are largely accessed through private, one-on-one interactions, such as chatbots. AI companies collect user interaction data, but are reluctant to share it even with vetted researchers, out of privacy, security, reputational, and competitive and trade secrecy concerns (Bommasani et al., 2024; Sanderson & Tucker, 2024). Instead, companies allow researchers and other external parties to probe their systems for vulnerabilities and harmful errors through practices such as red-teaming (Friedler et al., 2023). While these methods can help prevent AI systems from being used for the worst possible use cases, they do not offer empirical insights about the harms users experience in the real world.
The lack of available empirical information about how people use general purpose AI systems makes it extremely challenging to develop evidence-informed policy. Three potential methods can help address this use case information gap, each with its own benefits and challenges:
- Data donations. Users can voluntarily share data about their own interactions with AI systems (e.g., chat logs) directly with researchers (Sanderson & Tucker, 2024). AI companies can build technical tools to support this, including APIs, data portability tools, or a “Share your data with researchers” option. Researchers can also allow users to donate data directly, typically through browser extensions, without needing permission or support from companies. (Shapiro et al., 2021). Data donations raise few privacy concerns, but may introduce sampling bias, since those with the interest and technical skills to donate their data may not represent AI users writ large (van Driel et al., 2022).
- Transparency reports. AI companies can analyze data about how people use their systems and share their findings with the public (Bommasani et al., 2024; Vogus & Llansó, 2021). Companies can solicit feedback from experts in high-risk domains, such as health care and elections, about what information would be of use to them. This kind of transparency report differs from the current White House voluntary commitments and similar efforts around the world, which focus on disclosing companies’ efforts to keep users safe. Transparency reports raise little privacy risk, but can be opaque in their methodologies and details and potentially co-opted to serve company interests (Parsons, 2017).
- Direct access to log data. AI companies can grant researchers access to chat log data and other information they hold about users’ interactions with their products. Companies could provide this access directly, or indirectly by running queries on behalf of researchers. Companies could also provide this information voluntarily or, potentially, mandated under law (Lemoine & Vermeulen, 2023). Direct access poses significant privacy risks. While technical interventions might partially mitigate these risks, they may not be able to address them sufficiently to justify the practice. Companies may further resist granting direct data access, as it could jeopardize their reputation or expose corporate secrets.
This paper proceeds in three parts. First, it describes the use case information gap, why it should be closed, and what challenges there are to doing so. Then, it gives more detail on the three approaches to providing researchers access to use case information previously mentioned. Finally, it offers recommendations for how AI companies and lawmakers can implement these approaches in ways that benefit researchers and ultimately the public, while safeguarding users’ privacy.
Definitions and Scope
This paper specifically focuses on researcher access to use case information for popular, consumer-facing general purpose AI applications. In practice, this means sharing chat logs from chatbots built by foundation model developers, such as OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude. This paper does not focus on these systems because they are the most important — indeed, they arguably receive too much attention already — but for practical reasons.
Working backwards, this paper focuses on AI applications, rather than foundation models (e.g., GPT-4, Claude 3 Opus, Llama) or model hosting services (e.g., the GPT-4 API, Stable Diffusion, Microsoft Azure). (Jones, 2023). Foundation models may not always have a centralized entity to monitor their use, as is the case with “open source” models, such as Llama and Mistral (Solaiman, 2023). Hosting services could in theory monitor AI usage, but moving governance and surveillance lower down the technical stack raises greater privacy concerns (Donovan, 2019). This merits its own analysis, outside the scope of this paper. This paper also focuses on consumer-facing AI products rather than business-to-business services, as the latter involves trade secrecy concerns that are beyond the scope of this study. Furthermore, it focuses on popular AI applications because they are more likely to have significant societal effects that merit research scrutiny and more likely to have the resources needed to build the infrastructure necessary to make usage data available to researchers.
Finally, this paper borrows the concept of “general-purpose AI” (GPAI) from the EU AI Act, which defines it as, “an AI model, including when trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable to competently perform a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications.” (AI Act, Article 3, Section 44b). While concepts like “generality” and “capability” are up for debate, this paper focuses on chatbot applications built on top of state-of-the-art models designed to cover the broadest range of domains, rather than narrow uses such as customer service chatbots.
With a definition of “AI systems” in hand, we can clarify what we mean by use case information. This paper primarily focuses on use case information as chatlogs, i.e. the text and other media content of a user’s messages and the AI system’s responses. Chatlogs are limited, since they reveal nothing about the context of usage. For example, a user asking a chatbot to write an email asking for an unpaid payment could be using that text to run a phishing scam or to help navigate an awkward conversation with an associate about money. As will be discussed later, chatlogs also risk exposing very personal or personally identifiable information, which can be challenging to conceal from researchers.
Use case information can also include metadata, which is information about the data. Metadata may encompass details about the conversation itself, such as timestamps, session identifiers, AI system versions, error logs, usage policy violations, and refusals, as well as other actions the user has taken, such as regenerating a response or flagging content. It can also include information about the user, such as user identifiers, device information, and location data, but due to the high risk of user re-identification, information about the user is outside the scope of this paper.