Skip to Content

AI Policy & Governance, CDT AI Governance Lab

Hypothesis Testing for AI Audits

Introduction

AI systems are used in a range of settings, from low-stakes scenarios like recommending movies based on a user’s viewing history to high-stakes areas such as employment, healthcare, finance, and autonomous vehicles. These systems can offer a variety of benefits, but they do not always behave as intended. For instance, ChatGPT has demonstrated bias against resumes of individuals with disabilities,[1] raising concerns that if such tools are used for candidate screening, they could worsen existing inequalities. Recognizing these risks, researchers, policymakers, and technology companies increasingly emphasize the importance of rigorous evaluation and assessment of AI systems. These efforts are critical for developing responsible AI, preventing the deployment of potentially harmful systems, and ensuring ongoing monitoring of their behavior post-deployment.[2]

As laid out in today’s new paper from CDT, “Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems,” organizations can use a variety of assessment techniques to understand and manage the risks and benefits of their AI systems.[3] Some assessments take a broad approach, identifying the range of potential harms and benefits that could arise from an AI system. For example, an AI company might engage with different stakeholders who may be affected by the system to explore both the positive and negative impacts it could have on their lives. Other assessments are more focused, such as those aimed at validating specific claims about how the AI system performs. For example, a company developing a hiring algorithm may want to verify whether the algorithm recommends qualified male and female candidates at the same rate.

Stakeholders have noted the importance of evaluating specific claims about AI systems through what are often referred to as AI audits. Researchers have drawn comparisons between AI audits and hypothesis testing in scientific research,[4] where scientists determine whether the effects observed in an experiment are likely meaningful or simply due to random chance. Similarly, hypothesis testing offers AI auditors a systematic approach to assess patterns in AI system behavior. This method can help gauge the evidence supporting a claim, such as whether an AI system indeed avoids discrimination against particular demographic groups.

Using hypothesis testing in AI audits offers several advantages. It is a well-established method in empirical research, and so can provide AI auditors with tools for evaluation and interpretation. And hypothesis testing helps auditors quantify the uncertainty in their data, which is crucial for making informed decisions and developing action plans. However, like in other fields, hypothesis testing in AI has its limitations. Results can be influenced by factors other than the specific effects being evaluated, such as the particular subsets of data used to assess a claim. These sources of random error can impact the validity of the interpretations drawn from hypothesis tests. Therefore, accounting for these limitations is essential for auditors to make appropriate recommendations based on their analyses.

In this explainer, intended as a supplement to our broader paper on AI assessments, we focus in particular on the key ideas behind hypothesis testing, show how it can be applied to AI audits, and discuss where it might fall short. To help illustrate these points, we use computational simulations of a hypothetical hiring algorithm to show how hypothesis testing can detect gender disparities under different conditions.[5]

Hypothesis Testing in Statistics

Imagine a technology company is developing an algorithm that evaluates job applicants’ resumes, cover letters, and questionnaire responses to make hiring decisions. If the algorithm is trained on historical data from past applicants and hires, it might unintentionally learn existing biases or disparities, potentially leading to unfair hiring patterns. For example, an algorithm used to hire software engineers could end up disadvantaging female applicants due to the historical underrepresentation of women in technical fields. In such cases, an auditor may want to assess whether the algorithm recommends qualified male and female candidates at equal rates.

To conduct such a test, the auditor could use hypothesis testing, starting with a “null hypothesis” (H0), which usually represents the assumption that there is no difference between groups. The “alternative hypothesis” (H1) proposes that a difference does exist. In statistical terms, the hypotheses might look like this:

  • Null hypothesis / H0: There is no difference in the algorithm’s hiring recommendations for men and women.
  • Alternative hypothesis / H1: There is a difference in the algorithm’s hiring recommendations for men and women.


It might seem counterintuitive to set the default assumption as “no difference” if the auditor is investigating gender disparities. However, in hypothesis testing, the burden of proof lies on the evaluator to show that any observed effect (such as outcome disparities) is unlikely to have occurred by random chance. Later in this explainer, we will discuss how to use hypothesis testing when the null hypothesis assumes there is a difference.

When researchers conduct experiments, their goal is to understand how patterns or relationships appear in a broader group, or “population,” they are studying. Instead of gathering data from the entire population though, they typically rely on smaller subsets of data called “samples”— for a variety of reasons. For example, an auditor evaluating a hiring algorithm might not have access to data from every potential candidate, or it could be too time-consuming and costly to gather all this data. Instead, the auditor would analyze a smaller sample, assessing the level of disparity within that sample as a way to estimate the level of disparity in the population.

But results from a sample may not accurately represent the broader population due to random sampling variability. An auditor might, by chance, select for the sample a subset of men whom the algorithm is less likely to recommend compared to the overall population, or a sample of women whom the algorithm is more likely to favor. In other words, sample measurements are subject to some degree of random error. This is where hypothesis testing becomes essential—it allows researchers to evaluate whether the effects they observe in measurements of the sample are likely indicative of real patterns in the larger population, or if they could simply be due to random chance within that specific sample.

When conducting a hypothesis test, researchers need a way to decide whether to reject the null hypothesis, which means determining if there is enough evidence to conclude that an effect observed in a sample likely reflects a true effect in the population. This decision hinges on statistical significance, which indicates whether an observed effect in a sample is meaningful enough statistically to suggest it likely exists in the population. To assess statistical significance, researchers calculate a p-value, which represents the probability of obtaining results as extreme as those observed (or moreso) if the null hypothesis were true. 

The basic idea is that a strong effect observed in the sample — such as an algorithm recommending men for hire much more often than women — suggests it’s less likely that the result is due to random variation than a comparably weak effect might suggest. A common threshold for determining whether an effect is significant is a p-value of 0.05, which implies a 5% risk of concluding there is an effect when, in fact, there isn’t. If the p-value is less than 0.05, researchers conclude that the effect is statistically significant.

Random chance can influence the patterns researchers observe in a sample, leading to possible errors in researchers’ conclusions about the population. These errors fall into two categories. A Type I error occurs when investigators incorrectly conclude that there is an effect when, in reality, there is none; this is also known as a “false positive.” For example, if an auditor concluded that an algorithm demonstrates gender disparity based on the sample data when it actually does not, this would be a Type I error. Conversely, a Type II error happens when researchers fail to detect a real effect, resulting in a “false negative.” In this case, the auditor would fail to identify that the algorithm results in disparities, potentially allowing a discriminatory AI system to be deployed.

  • Type I error / false positive: Researchers or auditors incorrectly conclude there is an effect when there is none.
  • Type II error / false negative: Researchers or auditors fail to detect a real effect.

The probability of committing a Type II error is closely related to statistical power — the likelihood that a hypothesis test will correctly identify an effect when it exists.  Factors like sample size and the magnitude of the effect (e.g., the level of disparity driven by the algorithm) directly impact statistical power, emphasizing the need for careful planning in AI audits. Since low statistical power increases the risk of Type II errors or false negatives, auditors of AI systems will need sufficient statistical power to detect the effect they are investigating, such as gender disparity in an algorithm. 

Another factor that impacts statistical power is whether the hypothesis test evaluates differences only in one direction or in both directions. Imagine an auditor is particularly interested in whether the algorithm unfairly advantages male applicants. That is, even when men and women are equally qualified, it recommends men more frequently than women. Instead of testing for any difference between men and women, she specifically wants to investigate the level of evidence supporting bias against women. In this case, the alternative hypothesis focuses on men being selected more often than women, rather than looking for differences in either direction. This is known as a “one-sided” test in statistics, which is suitable when the goal is to investigate a specific outcome, such as a bias in favor of men. So here, the auditor’s null and alternative hypotheses would be:

  • Null hypothesis / H0: There is no difference in the hiring rates of men and women.  
  • Alternative hypothesis / H1: Men have a higher hiring rate than women.

This example illustrates an important aspect of hypothesis testing: how the hypotheses are framed directly impacts the interpretation of the results. Here, the auditor is testing specifically whether men are favored over women, not whether there is bias in favor of either men or women. As a consequence, if the algorithm instead favored women, this hypothesis formulation would not create the conditions for the auditor to detect that disparity because the test wasn’t designed to look for it.

It may seem counterintuitive to evaluate for differences (in this example, bias) only in one direction; however, in statistics, there are tradeoffs. In the case of one-sided tests, the tradeoff involves statistical power. A one-sided test has greater statistical power compared to a two-sided test because it focuses entirely on one direction of the effect. Imagine shining a flashlight to look for your keys in the dark. A one-sided test is like focusing all the light in one direction, making it easier to spot the keys in that direction, but impossible to find them if they are in the opposite direction. A two-sided test splits the light to cover both directions making it possible to search in both directions, but harder to see than if all of the light were in one spot. In many cases, it would make sense to have less light to cover more ground; however, if you have reason to expect that the keys would be to the right and not to the left, it makes more sense to focus on that area.

In our example, using a one-sided hypothesis makes the test more sensitive to detecting differences in the specified direction, which, in the auditor’s case, is a bias in favor of men. However, the tradeoff is that a one-sided test does not attempt to assess bias in the opposite direction.[6] Therefore, the decision to use a one-sided or two-sided test depends on the research question or audit objective, the stakes of evaluating bias in the opposite direction, and the context in which the results will be interpreted and acted upon.

Applying hypothesis testing to AI audits: A simulation approach

The main purpose of hypothesis testing is to examine patterns within a sample to make inferences about a larger population. In a real-world AI system audit, an auditor usually cannot assess the algorithm’s performance across the entire population. Auditors may face limitations in how much data they can analyze due to privacy concerns, regulations, resource constraints, or restrictions from the organization, making analyzing data within samples the only viable option. Or auditors may be hoping to understand not only how the algorithm performed with existing data, but how it might perform with future users, whose data would not be available to them. If full population data were available, hypothesis testing wouldn’t be necessary, as direct evaluation would provide the needed insights.

Explaining how both population dynamics and sample variability affect hypothesis testing can be challenging, especially since complete data on the population often doesn’t exist. In this explainer, we’ll rely on simulation — a computational approach that enables us to control population characteristics to explore their impact on hypothesis tests applied to samples — to illustrate these concepts. Our simulation will create a virtual population to model the algorithm’s hiring decisions, setting specific parameters, like the algorithm’s level of gender bias. This method allows us to illustrate hypothesis testing in a controlled setting, showing how factors such as sample size or the strength of gender bias in the algorithm can affect audit results.

To illustrate how hypothesis testing could be used in AI audits, we will simulate a population in which 5,000 men and 5,000 women are eligible to apply for a job. Let’s assume that 60% of both male and female applicants are qualified, and that the auditor is interested in understanding whether an algorithm exhibits demographic parity, or whether a machine learning algorithm makes decisions at the same rate for different demographic groups. In our simulation, achieving demographic parity would mean hiring men and women at the same rate. 

To illustrate the importance of hypothesis testing in AI auditing, we will simulate an algorithm that we know fails to achieve demographic parity. Figure 1 shows how the algorithm would make  hiring recommendations for male and female candidates in the entire population of candidates. Among qualified male applicants, the algorithm recommends hiring 80% of them, while it recommends only 60% of qualified female applicants. For unqualified applicants, the algorithm recommends hiring 20% of men but only 10% of women. Despite men and women being equally qualified, the algorithm ends up recommending women for hire 39% of the time, compared to 56% for men (a difference of about 0.17). For an auditor that has chosen demographic parity as the relevant fairness definitions, these recommendations indicate an unfair system. However, because the auditor would not have access to data on the entire population, the goal of her hypothesis testing would be to try to uncover these patterns in a smaller sample of the data.

Four pie charts, showing the proportions of applicants in the population that the algorithm would recommend, based on qualifications and gender. Top to bottom, left to right:
Qualified men: 80% hired, 20% not hired.
Qualified women: 60% hired, 40% not hired.
Unqualified men: 20% hired, 80% not hired. Unqualified women: 10% hired, 90% not hired.
Four pie charts, top to bottom, left to right: Qualified men: 80% hired, 20% not hired. Qualified women: 60% hired, 40% not hired. Unqualified men: 20% hired, 80% not hired. Unqualified women: 10% hired, 90% not hired.

Figure 1. Proportions of applicants in the population that the algorithm would recommend, based on qualifications and gender.

The auditor would need to decide whether to reject the null hypothesis by assessing the probability that any observed gender disparity in the sample occurred due to random chance. In other words, she would use her sample to determine if the algorithm demonstrates gender bias. However, due to random sampling error, the sample results might differ from the overall population, which would cause the outcome she observes to vary depending on which individuals are included in the sample.

Computational simulation can help demonstrate how sampling error can cause estimates of gender disparities to differ from the true population values. Starting with our simulated population of 5,000 men and 5,000 women, we randomly select 100 men and 100 women into the sample to observe the algorithm’s hiring recommendations. In this particular sample, the algorithm recommends hiring men 63% of the time and women 45% of the time (a difference of 0.18) — an estimate that shows slightly more gender disparity than what we know from the simulation scenario to be the true rates in the overall population. However, if we had chosen different samples, the estimates could have varied.

To understand how much sample estimates are likely to differ from the population, the simulation allows us to repeat this process multiple times and visualize variability in the results we observe. Figure 2 summarizes the results, with the x-axis representing the selection rates and the y-axis showing the frequency of each rate across the 100 simulations. The dashed lines indicate the selection rates in the entire population.

Graphic of an orange and blue histogram, showing the overlap in the distribution of the algorithm’s selection rates for men and women across 100 simulated samples. Men population rate: 0.56.
Women population rate: 0.39.
Graphic of an orange and blue histogram, showing the overlap in the distribution of the algorithm’s selection rates for men and women across 100 simulated samples. Men population rate: 0.56. Women population rate: 0.39.

Figure 2. Frequencies of the algorithm’s selection rates for men and women across 100 simulated samples.

The graph reveals that while most selection rates in the simulations cluster around the true population rates, there is still some variability. For example, in one simulated sample, the selection rate for women was as low as 27%, while in another, it reached 54%, almost matching the population rate for men. The graph demonstrates that while in general, results in a sample will tend to resemble the population, that is not necessarily the case in any given sample, which could lead to misleading audit results.

  • Statistical result: Across 100 simulations, each with 100 randomly selected men and women, the selection rate for women varied between 27% and 54%, while for men, it ranged from 45% to 67%. This variability in sample estimates, caused by sampling error, led to deviations from the true selection rates of the whole population.
  • Interpretation: The estimated disparities resulting from the algorithm in a given sample may appear more or less severe than it actually is in the population, depending on which specific individuals are randomly selected.

Another way to visualize these experimental results is by using a histogram that shows the difference in selection rates between men and women for each experiment, as seen in Figure 3. The dashed line marks the selection rate difference in the entire population, which is approximately 0.17.

Graphic of a blue histogram, showing the frequencies of the algorithm’s selection rate difference for men and women across 100 simulated samples. Population selection rate difference: 0.17.
Graphic of a blue histogram, showing the frequencies of the algorithm’s selection rate difference for men and women across 100 simulated samples. Population selection rate difference: 0.17.

Figure 3. Frequencies of the algorithm’s selection rate difference for men and women across 100 simulated samples.

While most outcomes apparent from samples cluster around this population difference, some indicate a much larger divergence. In one experiment, women were even selected more frequently than men, demonstrating the opposite effect as the trend observed in the overall population.[7] This random variation highlights a key challenge in hypothesis testing: determining whether the observed results indicate a genuine effect in the population or are simply due to chance. When auditors assess an algorithm’s behavior based on a sample, they must acknowledge that random error can influence their estimates. Therefore, auditors should interpret their findings cautiously, framing their conclusions with an awareness of sampling variability and its limitations.

Instead of just examining the selection rates, our simulation allows us to run a formal statistical test on each of our simulated samples and calculate a p-value in the same way that an auditor would also calculate a p-value in an audit. The p-value indicates the likelihood of obtaining the observed results—or even more extreme ones—purely by chance if there were actually no difference between the algorithm’s treatment of men and women in the overall population.

We can visualize the p-values from the 100 experiments in a histogram, as shown in Figure 4.

Graphic of a blue histogram, showing the frequencies of p-values in simulated sample tests.

Figure 4. Frequencies of p-values in simulated sample tests.

Using a significance level of p < 0.05, we would correctly reject the null hypothesis of no difference in selection rates in 78 out of the 100 experiments. In other words, 78% of the time, we would correctly conclude that it is unlikely that the disparity in our sample emerged by chance alone. However, in 22 experiments, the evidence was not strong enough to reject the null hypothesis, resulting in a Type II error (false negative).

This means that the auditor could fail to identify a statistically significant disparity in the algorithm due to the random selection of individuals in the sample, even though the algorithm does in fact produce disparate recommendations. In our simulation, this failure to detect the disparity would occur more than 20% of the time!

Statistical result: Across 100 simulations, each with random samples of 100 men and 100 women, and using a significance level of p < 0.05, the statistical test found a significant difference in 78% of the cases. However, in 22% of the samples, the test failed to reject the null hypothesis, incorrectly suggesting no difference in selection rates in the population (a Type II error).
Interpretation: In 78 out of 100 tests, the statistical test correctly identified that the algorithm leads to disparities. However, in 22 cases, it failed to detect the true difference. In the context of auditing an algorithm that does result in gender disparity, this would mean that the auditor would correctly conclude that the algorithm recommends women less frequently than men 78% of the time. But 22% of the time, the auditor would conclude that there was insufficient evidence to conclude that the algorithm led to disparities.

Drawing erroneous conclusions over 20% of the time is clearly not ideal. Fortunately, there are ways to improve the robustness of these conclusions. One method is to collect data from larger samples of men and women. As sample size increases, the samples tend to more closely resemble the overall population, reducing the likelihood of random error that causes large deviations in the sample outcomes.

However, while larger sample sizes lower the risk of statistical errors, gathering them is not always feasible. Smaller samples are often quicker and more cost-effective to collect, especially when data collection is time-consuming, expensive, or logistically difficult. As with analyzing data on entire populations, auditors may not be able to access large samples due to organizational restrictions or resource and operational constraints. Yet, relying on samples that are too small can lead to inaccurate conclusions, potentially resulting in the deployment of systems that cause real-world harm. Therefore, auditors must balance efficiency and accuracy, aiming to gather the minimum sample size necessary to reliably detect the patterns they are investigating.

We can use our simulation to illustrate how sample size affects the accuracy of statistical conclusions. Instead of gathering data from 100 men and 100 women, we can increase the sample size to 250 in each group and repeat the experiment 100 times. The results are shown in Figure 5. 

Graphic of an orange and blue histogram, showing the distribution of frequencies of the algorithm’s selection rates for men and women across 100 simulated samples, with sample sizes of 250 men and women. Men population rate: 0.56.
Women population rate: 0.39.
Graphic of an orange and blue histogram, showing the distribution of frequencies of the algorithm’s selection rates for men and women across 100 simulated samples, with sample sizes of 250 men and women. Men population rate: 0.56. Women population rate: 0.39.

Figure 5. Frequencies of the algorithm’s selection rates for men and women across 100 simulated samples with sample sizes of 250 men and women.

With this larger sample size, fewer experiments produced selection rates that deviated significantly from the population rates. Similarly, the statistical tests yielded p-values below 0.05 in 97 out of 100 experiments. This demonstrates that increasing the sample size significantly reduces the chances of making incorrect statistical conclusions about the population.

Sample size is not the only factor influencing an auditor’s ability to detect a real difference in a population from a sample; the magnitude of the difference also plays a key role. Consider a hypothetical population where the gender disparity is more pronounced than in our initial scenario. This time, we will set the difference in selection rates between men and women in the simulated scenario to be larger—0.28 instead of 0.17. In this scenario, even with only 100 participants in each group, we would correctly reject the null hypothesis that the algorithm selects men more often than women 100 out of 100 times.[8] In other words, the larger the effect size in the population, the easier it is to detect in a sample, even if the sample size is smaller.


In some applications, like hiring, where disparity testing is already common, auditors may use existing laws or norms to set specific thresholds for unacceptable disparities. For instance, employment discrimination jurisprudence leans on the “four-fifths” or “80%” rule of thumb that suggests that further investigation is warranted if the selection rate for any protected group (e.g., based on race or gender) falls below 80% of the rate for the group with the highest selection rate. Although the four-fifths rule is often misapplied in AI contexts,[9] it can still serve as a relevant threshold in certain situations. However, simply falling below 80% in a given sample may not be sufficient evidence to conclude that the disadvantaged group’s outcomes are below 80% of those of the more advantaged group in the entire population. In these cases, auditors will still need to use specific statistical tests to challenge the null hypothesis that the disadvantaged group receives at least 80% of the positive outcomes compared to the more advantaged group. When testing against specific thresholds rather than looking for whether there is any difference, auditors will still need to account for the possibility that random error could explain the observed differences in a sample.

In empirical research, scientists often use a method called power analysis to estimate how large a sample needs to be to detect an effect of a certain size. Our simulation functions similarly to a power analysis: by adjusting assumptions about the size of the difference in the population and experimenting with sample sizes, we can determine how large groups need to be to achieve an acceptable margin of error.[10] However, while power analysis is a useful tool, it does not guarantee that a true effect will be found, even if the sample size matches the recommended value. Moreover, if the auditor is uncertain about the expected effect size, choosing the appropriate sample size becomes more challenging. 

Practitioners using hypothesis testing in auditing should be aware of its limitations, including the fact that it doesn’t ensure meaningful effects will be detected in a sample when they exist in the population. To reduce the risk of drawing inaccurate conclusions about the population, auditors should ensure their sample is large enough to detect the anticipated effect size. If they are unsure of the expected effect magnitude, they should base their sample size estimates on the smallest effect size they would consider meaningful.

To someone less familiar with statistics, it might seem reasonable to try another approach: if the first test doesn’t show a significant result, the auditor could simply repeat the test with different samples or continue collecting data until the test yields a significant outcome. However, this approach greatly increases the risk of a false positive. Just as sampling error can sometimes hide real effects, it can also exaggerate them. The more tests the auditor runs, the higher the chance of finding a statistically significant result purely by chance, leading to a Type I error—incorrectly concluding that there is a meaningful effect in the population when there actually isn’t one.

This practice is known as p-hacking, where researchers manipulate their analyses or repeat tests until they achieve statistically significant results. P-hacking undermines the validity of findings, as it exploits random fluctuations in data rather than revealing true effects. Auditors must avoid this pitfall by defining their sample sizes and analysis plans before collecting any data. This improves the likelihood that any conclusions drawn from the sample are reliable.

In more complex or repeated audits, the auditor may need to conduct multiple statistical tests. For instance, they might test whether the algorithm results in disparities overall, as well as explore differences between intersectional subgroups, such as black women versus white women, or black women versus white men. Or the auditor might want to perform analyses at different points in time.

In these situations, auditors should use techniques known as multiple comparisons correction to reduce the risk of drawing conclusions based on false positives. This approach adjusts the threshold for statistical significance depending on the number of tests the auditor conducts. For example, if the auditor performs 10 tests, a multiple comparisons correction might lower the p-value threshold from 0.05 to a more stringent level (e.g., 0.005). Essentially, multiple comparisons correction demands stronger evidence before concluding that an effect exists, thereby reducing the risk of Type I errors and inaccurate findings.

Using simulations to demonstrate a lack of gender disparity in an algorithmic system

So far, we’ve discussed how auditors can use hypothesis testing to assess systems for gender disparity, typically starting with the null hypothesis that no difference in selection rates exists in the population. However, in some cases, it might be more appropriate to reverse the hypotheses: Some scholars argue that when quantitative measures indicate a performance disparity, the burden of proof should be on the company to demonstrate that no disparity exists.[11] In other words, the default assumption should shift to the algorithm showing disparity.

In our auditing example, the null hypothesis could state that men are hired more often than women, while the alternative hypothesis would suggest no difference in hiring outcomes between the two groups. In this setup, the burden of proof falls on the auditor to show that, based on the sample, there is sufficient reason to believe that no disparity is present.

However, particularly in statistical testing, the absence of evidence is not the same as evidence of absence. For example, in our first simulated scenario, the selection rate difference between men and women was 0.17, but due to the small sample size, the hypothesis test failed to reach statistical significance in over 20% of the samples. In this case, the experiment simply lacked the power to detect the disparity. Companies seeking to avoid accountability might design audits with insufficient statistical power, ensuring that even if their systems show disparity, the test results would not allow auditors to confidently identify it.

Failing to reject a null hypothesis that there is no difference does not prove that no difference exists in the population. In statistics, it generally requires more evidence to conclude that an effect is absent than to suggest it is present. Consider this analogy: if someone is trying to determine whether a haystack contains any needles and only searches a portion of it without finding one, this doesn’t necessarily mean there are no needles. The more of the haystack they search, the more confident they can be that no needles are present.

It is statistically impossible to prove that two different populations are exactly the same in any particular respect, but it is possible to evaluate whether any difference between them is likely small enough to be considered acceptable.[12] This approach, known as “non-inferiority testing,” was developed by researchers in pharmacology. For example, pharmacology researchers might use non-inferiority testing to determine whether a generic drug is not significantly worse than the brand-name version. Another example is assessing whether a new drug, which may have fewer side effects or be easier for patients to take, isn’t significantly less effective than an existing drug that is more challenging to use.

In non-inferiority testing, researchers first define what they consider an acceptable difference. They then perform a statistical test to estimate, within a margin of error, how much one treatment might be worse than another in the overall population. If the worst-case scenario (the lower bound of the estimate) does not exceed the pre-defined acceptable difference, they can reject the null hypothesis that the difference is unacceptably large.

We can apply this concept to our simulation of an algorithmic hiring system. Returning to the scenario where the selection rate difference between men and women in the population is 0.17, we could introduce a threshold of 0.2. This means we want to assess how likely we are to correctly reject the null hypothesis that the difference between men’s and women’s selection rates is greater than 0.2.

In the population, we know that the selection rate difference is less than 0.2. However, when we analyze the p-values from the non-inferiority tests (shown in Figure 6), the test was significant in only 10 out of 100 experiments. In the other 90 experiments, we would incorrectly fail to reject the null hypothesis. As a result, we would not be able to statistically conclude that the algorithm does not produce outcome disparities for women that exceed our threshold.

Graphic of a blue histogram, showing the frequencies of p-values in simulated sample non-inferiority tests. Significance: p < 0.05.
Graphic of a blue histogram, showing the frequencies of p-values in simulated sample non-inferiority tests. Significance: p < 0.05.

Figure 6. Frequencies of p-values in simulated sample non-inferiority tests.

We can attempt to rectify this problem, as before, by expanding our sample size to 250 men and 250 women. This raises the number of statistically significant samples to 12, but still means that in the majority of cases, the auditor’s experiment is not sufficiently powerful to make the correct conclusion. Even with a sample size of 1,000 men and 1,000 women, the auditor would only be able to correctly reject the null hypothesis 39 times out of 100. In sum, in order to be sufficiently powered, an experiment testing these specific hypotheses would need to have a very large sample, one that would constitute a significant proportion of the relevant populations. 

Other factors that would affect this test would be the threshold for acceptable difference and the actual difference that exists in the population. If we set the threshold at 0.25 — well above the level of gender disparity in the population — with samples of 500 men and women, we would correctly reject the null hypothesis in 85 out of 100 experiments. 

Practitioners interested in auditing AI systems should bear in mind that, when performing non-inferiority testing, the smaller the difference is between their threshold and the difference in the population, the larger their sample will need to be. Practically, this may mean that audits leveraging non-inferiority testing will be more expensive or resource/intensive to conduct than those that are conducting traditional hypothesis tests. In instances where sufficiently large samples cannot be analyzed, non-inferiority testing may not offer a viable approach.

Interpreting traditional hypothesis tests can be challenging, especially for those less familiar with statistics, and non-inferiority tests can be even more difficult to understand. However, auditors should be aware that if companies want to demonstrate that their systems do not show disparity beyond a certain level, they will need to rely on non-inferiority testing to support these claims. Simply failing to reject a traditional null hypothesis of no difference is not enough. Also, even when the null hypothesis in a non-inferiority test is rejected, the auditor cannot conclude that outcomes for the group of interest are not inferior at all — only that they are not more inferior than the specified threshold.[13] Statistical tests are used to evaluate precise claims. As such, the results of those tests must also be interpreted precisely.

***

Hypothesis testing can be a valuable tool in AI audits, offering a structured framework for assessing potential issues within AI systems. By using well-established statistical methods, auditors can evaluate the strength of the evidence supporting a hypothesis about how an AI system behaves while accounting for various sources of uncertainty.[14] These calibrated assessments enable auditors to draw informed conclusions and provide guidance to companies or third parties, which may include recommending remedies or, in some cases, catalyzing enforcement actions. However, hypothesis testing is not without its challenges — the same limitations that affect its use in empirical sciences also apply in AI auditing. Therefore, auditors and those interacting with them should approach these assessments with a solid understanding of statistical constraints, potential errors, and the practical aspects of data collection. This careful approach will allow for more accurate interpretations and the formulation of robust, evidence-based recommendations that promote the development and deployment of responsible AI systems. 


[1] Kate Glazko et al., “Identifying and Improving Disability Bias in GPT-Based Resume Screening,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24: The 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro Brazil: ACM, 2024), 687–700, https://doi.org/10.1145/3630106.3658933. [perma.cc/8ZBC-E3G2

[2] Merlin Stein and Connor Dunlop, “Safe beyond Sale: Post-Deployment Monitoring of AI,” Ada Lovelace Institute (blog), June 28, 2024, https://www.adalovelaceinstitute.org/blog/post-deployment-monitoring-of-ai/. [perma.cc/4WV8-ZW3H

[3] Bogen, M. (2025). Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems. Center for Democracy and Technology. https://cdt.org/insights/assessing-ai-surveying-the-spectrum-of-approaches-to-understanding-and-auditing-ai-systems/

[4] Sarah H. Cen and Rohan Alur, “From Transparency to Accountability and Back: A Discussion of Access and Evidence in AI Auditing,” in Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’24: Equity and Access in Algorithms, Mechanisms, and Optimization, San Luis Potosi Mexico: ACM, 2024), 1–14, https://doi.org/10.1145/3689904.3694711.  [perma.cc/8F57-EL95

[5] Code to reproduce the simulation can be found at: https://github.com/amywinecoff/ml-teaching/blob/main/audit_simulation.ipynb. [https://perma.cc/7FTM-2Y9J

[6] It’s important to note that in a one-sided test, evidence within a sample showing that the algorithm selected women more often would not be ignored. Instead, it would simply be interpreted as not providing support for the alternative hypothesis that men are recommended more frequently than women.

[7] Patterns in samples that contradict the expected effect direction cannot be assessed with a one-directional test. For this reason, researchers usually reserve one-directional tests for situations where there is a strong theoretical, empirical, or legal justification.

[8] This does not guarantee that an auditor would find an effect 100% of the time, merely that in our simulation, we correctly rejected the null hypothesis in 100 out of 100 samples. If we run the simulation 1,000 times, we correctly reject the null hypothesis 996 times. 

[9] Elizabeth Anne Watkins and Jiahao Chen, “The Four-Fifths Rule Is Not Disparate Impact: A Woeful Tale of Epistemic Trespassing in Algorithmic Fairness,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24: The 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro Brazil: ACM, 2024), 764–75, https://doi.org/10.1145/3630106.3658938.  [perma.cc/YG66-2STY

[10] For relatively straightforward statistical tests, software such as G*Power (https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html [https://perma.cc/4BC8-R3HA]) and code libraries such as Python’s statsmodels (https://www.statsmodels.org/ [perma.cc/8TEH-WUJK]) can offer sample size estimates. However, these packages may be less reliable for more complex experimental designs. 

[11] The Limits of the Quantitative Approach to Discrimination, 2022 James Baldwin Lecture (Princeton University, 2022), https://www.cs.princeton.edu/~arvindn/talks/baldwin-discrimination/.  [perma.cc/ST5X-SWYQ

[12]  Daniël Lakens, “Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses,” Social Psychological and Personality Science 8, no. 4 (May 2017): 355–62, https://doi.org/10.1177/1948550617697177. [https://perma.cc/JT84-74FZ

[13] Jennifer Schumi and Janet T Wittes, “Through the Looking Glass: Understanding Non-Inferiority,” Trials 12, no. 1 (December 2011): 106, https://doi.org/10.1186/1745-6215-12-106. [perma.cc/E6FF-2NTN

[14]  We note that hypothesis testing can be a valuable tool for evaluating statistical claims about system behavior; however, conclusions drawn from sample data do not necessarily imply intrinsic properties of the system. For example, if an auditor finds that an algorithm likely shows gender disparity in its recommendations, this does not necessarily indicate that the system is inherently “gender biased” in a more abstract sense. If the auditor chose different definitions of gender or gender disparity, or if the system were evaluated within a distinctly different population (e.g., the U.S. versus Japan), the hypothesis test might yield a different pattern of evidence.