AI Policy & Governance, CDT AI Governance Lab
Trustworthy AI Needs Trustworthy Measurements
Last month, the National Institute of Standards and Technology (NIST) formally launched the U.S. AI Safety Institute (USAISI), tasked with developing guidelines for areas including AI risk assessment, capability evaluation, safety best practices, and watermarking as laid out in President Biden’s ambitious Executive Order on AI. To support the efforts of the USAISI, NIST also established the U.S. AI Safety Institute Consortium made up of over 200 private companies, civil society groups, and academics. At its launch, the Consortium was described as “focus[ing] on establishing the foundations for a new measurement science in AI safety.”
The Consortium’s focus on measurement is for good reason: effective frameworks for managing AI risks depend on the ability to accurately evaluate the impacts of AI. Without sound measurements, AI developers cannot accurately capture the nature and scope of risks, let alone gauge the efficacy of interventions to mitigate them. Beyond more general risk assessment efforts, methodologically sound measurements can also help researchers and developers track various dimensions of AI progress over time — both in terms of utility and risks.
Unfortunately, measuring AI systems is significantly harder than it may seem. Evaluation is especially challenging when the qualities of AI systems the stakeholders care about are hard to define, much less quantify. Take, for example, fairness. AI systems can distribute fewer resources to socially disadvantaged groups, perpetuate harmful stereotypes, or downrank content produced by marginalized creators. Each of these problems reflects a different kind of unfairness. As a result, each requires a different measurement approach.
The rise of so-called “general purpose AI” or “foundation models” makes measurement even more complicated. General-purpose AI systems are trained on vast datasets to learn patterns and relationships within data, such as the structure of language or general knowledge about the world. Much like flour, yeast, and water can be combined in different ways to make many types of bread, general purpose AI systems can be adapted to a wide variety of tasks. For example, models like the one powering ChatGPT can be adapted to detect the topic of a passage of text, to answer factual questions about a specific domain (e.g., biology), or to generate computer code. Since developers of general-purpose AI systems cannot always foresee how their models might be adapted and used downstream, they often struggle with identifying and employing measurements that accurately capture the models’ capabilities, risks, and limitations.
While AI technologies may be new, the challenge of measuring nebulous concepts has a long history. Social science fields such as psychology, sociology, and education have long wrangled with measuring qualities that do not lend themselves neatly to quantification. For example, psychologists struggle with defining what constitutes “spatial reasoning,” and education researchers seek to develop notions of “teacher quality.” Arriving at measurements of such qualities is difficult because it is not possible to observe them directly. Unobservable, squishy concepts are called “constructs” in the social sciences. To measure constructs, researchers must define them in terms of a measurable quality–a process called “operationalization.” For example, psychologists could operationalize the construct of spatial reasoning based on how frequently people say they get lost while driving. Education researchers could operationalize teacher quality in terms of students’ performance on end-of-grade tests. Clearly, both of these operationalizations have weaknesses. People might report their navigation abilities as better than they are, and many factors other than the quality of instruction are likely to impact end-of-grade test scores.
Similar problems emerge in AI systems. For example, some AI tools purport to assess candidates’ suitability for jobs, but in reality, may be screening out applicants with disabilities. One of the benefits of being explicit about the definition of a construct and its operationalization is that practitioners can better identify problems with assumptions implicit in their measurement methods (e.g., that people will not lie about or overestimate their abilities). By clearly surfacing these assumptions, practitioners can be much more careful when interpreting measurement results and identify whether and how they may need to bolster their conclusions with other evidence.
Problems with how constructs are operationalized into measurements often come down to issues with validity and reliability. Validity refers to how well a measurement captures the construct it is intended to capture. Reliability refers to how consistently a measurement can capture a construct. These two notions are foundational in the social sciences. Validity and reliability assessments help researchers determine whether they need to include additional measures of their constructs to more adequately capture them or whether they need to choose a different measure altogether. In the case of teacher quality, education researchers are likely not only to measure end-of-grade test scores, but also account for other factors such as students’ test scores from the previous year, students’ teacher evaluations, and administrator and parent support for teachers. They also may choose to exclude measures if their reliability is known to be poor. For example, to estimate the reliability of end-of-grade tests themselves, education researchers might give the same students very similar forms of the test. If students perform well on one test and poorly on the other, this could indicate that researchers should not rely on test scores as a reliable measure of teacher quality.
Yet, AI researchers rarely evaluate the reliability or validity of their measurements, which is a significant issue for the field. For example, in papers assessing fairness and bias in AI systems, researchers often failed to articulate what constructs they were attempting to measure and frequently included different conceptualizations of the same construct within a single measurement. Others failed to employ basic quality control in their measurement instruments, such as using language evaluation tools riddled with typos and misspellings. In a survey of evaluation practices in AI research papers, few assessed how well models would perform on datasets that were very different from the datasets they were trained on. This type of evaluation is important because the contexts where models are deployed (e.g., educational software) can vary significantly from the contexts where training data were collected (e.g., Reddit).
Fortunately, parts of the AI research community have begun calling for AI developers and researchers to look to the example of measurement approaches employed in the social sciences. While these approaches are not perfect, they do grapple more explicitly with whether measurements truly capture relevant information than similar efforts in the realm of AI systems.
Although some measurement techniques in the social sciences are complex, spotting common sense gaps can be straightforward. For example, developers often evaluate AI systems by seeing how well they can perform particular tasks using what is known as a “benchmark” dataset. For example, a benchmark dataset might contain questions with multiple-choice answers. Using this benchmark, developers may evaluate an AI system based on how often it identifies the correct answer from the choices. To have confidence that a benchmark dataset is truly able to measure a model’s underlying capability, we would expect the AI’s performance to remain relatively static even if question formats are slightly altered, such as by shuffling the choice order. Unfortunately, many language models show notably different performance in response to minor changes in prompts, which suggests that these sorts of capabilities assessments are not reliable.
When it comes to validity, we expect measures of the same or similar constructs to be more associated than measures of very different constructs. For example, researchers may aim to assess how well models perform on mathematical reasoning using several benchmark datasets. For a measurement to be valid, we would expect a model to perform similarly across these datasets. If the model shows very different performance, this could indicate fault with the measures themselves or how the construct of mathematical reasoning was initially conceptualized. By assessing validity, developers can gain insight into how models or measurements have gone awry. This information can help them diagnose and solve problems.
Reliable and valid evaluation and testing are the linchpin of myriad other policy goals, not least those the U.S. AI Safety Institute has been tasked with advancing. As policymakers and practitioners delve into the challenges of ensuring that AI is deployed responsibly and safely, developers and researchers must prioritize rigorous evaluation methods to inform these efforts. Without them, the policy interventions many hope will mitigate AI risks could end up sitting on shaky ground.