Skip to Content

AI Policy & Governance, CDT AI Governance Lab

Adopting More Holistic Approaches to Assess the Impacts of AI Systems

by Evani Radiya-Dixit, CDT Summer Fellow

As artificial intelligence (AI) continues to advance and gain widespread adoption, the topic of how to hold developers and deployers accountable for the AI systems they implement remains pivotal. Assessments of the risks and impacts of AI systems tend to evaluate a system’s outcomes or performance through methods like auditing, red-teaming, benchmarking evaluations, and impact assessments. CDT’s new paper published today, “Assessing AI: Surveying the Spectrum of Approaches to Understanding and Auditing AI Systems,” provides a framework for understanding this wide range of assessment methods; this explainer on more holistic, “sociotechnical” approaches to AI impact assessment is intended as a supplement to that broader paper.

While some have focused primarily on narrow, technical tests to assess AI systems, academic researchers, civil society organizations, and government bodies have emphasized the need to consider broader social impacts in these assessments. As CDT has written about before, AI systems are not just technical tools––they are embedded in society through human relationships and social institutions. The OMB guidance on agency use of AI and the NIST AI Risk Management Framework seem to recognize the importance of social context, including policy mandates and recommendations for evaluating the impact of AI-powered products and services on safety and rights.

Many practitioners use the term “sociotechnical” to refer to these human and institutional dimensions that shape the use and impact of AI. Researchers at DeepMind and elsewhere have recommended frameworks that help envision what this more holistic approach to AI assessment can look like. These frameworks consider a few layers: First, assessments at the technical system layer focus on the technical components of an AI system, including the training data, model inputs, and model outputs. Some technical assessments can be conducted when the application or deployment context is not yet determined, such as with general-purpose systems like foundation models. But since the impact of an AI system can depend on factors like the context in which it is used and who uses it, evaluations focused on the human interaction layer consider the interplay between people and an AI system, such as how AI hiring tools transform the role of recruiters. And beyond this, an AI system can impact broader social systems like labor markets on a larger scale, requiring attention to the systemic impact layer. Assessments of the human interaction and systemic impact layers, in particular, require understanding the context in which an AI system is or will be deployed, and are critical for assessing systems built or customized for specific purposes. Importantly, these three layers are not neatly divided, and social impacts often intersect multiple layers.

To illustrate how these three layers can be applied in a tangible context, we consider the example of facial recognition. Clearly a rights-impacting form of AI, this example usefully demonstrates how social context can be incorporated in technical assessments, while also highlighting the limitations of technical assessments in addressing broader societal impacts.

The Need for More Holistic Approaches

Current approaches for assessing the impacts of AI systems often focus on their technical components and rely on quantitative methods. For example, audits that evaluate the characteristics of datasets tend to use methods like measurement of incorrect data and ablation studies, which involve altering aspects of a dataset and measuring the results. Initial industry efforts towards more holistic approaches to assess AI’s impacts have often involved soliciting and crowdsourcing public input. For example, OpenAI initiated a bug bounty program and a feedback contest to better understand the risks and harms of ChatGPT. While these efforts help prevent technical assessments from being overly driven by internal considerations, they still raise questions about who is included, whether participants are meaningfully involved in decision-making processes, and whether broader harms like surveillance, censorship, and discrimination are being considered in the public feedback process.

Given the limits of narrow evaluation and feedback methods, we emphasize the role of mixed methods––incorporating both qualitative and quantitative approaches––across different layers of assessment. While quantitative metrics can be useful for evaluating AI systems at scale, they risk oversimplifying and missing nuanced notions of harms. In contrast, qualitative assessments can be more holistic, although they may require more resources. 

Graphic of a table, showing examples of a Quantitative Assessment vs. Qualitative Assessment.
Graphic of a table, showing examples of a Quantitative Assessment vs. Qualitative Assessment.

As indicated in the table above, practitioners should actively consider social context across each layer and center marginalized communities most impacted by AI systems to ensure that assessments address the systemic inequities these communities face. These considerations can be strengthened through participatory methods that involve users and impacted communities in decision-making processes over how AI systems are evaluated.

To make these approaches actionable for practitioners, below we outline an array of methods to better assess and address the impacts of AI systems, along with examples of assessments that use these methods.

1. Incorporate social context and community input into evaluations of AI’s technical components

Evaluating an AI system requires not only analyzing its technical components but also examining its impact on people and broader social structures. Traditional assessments often narrowly evaluate impacts at the technical system layer like accuracy or algorithmic bias, relying on quantitative metrics pre-determined by researchers and practitioners. However, even when conducting a technical assessment, there are opportunities to consider the social dimensions of the technical components and decisions that shape the AI system.

By integrating context about social and historical structures of harm, researchers and practitioners can better identify which impacts to evaluate –– such as a more nuanced notion of bias –– and determine the appropriate quantitative or qualitative methods for assessing those impacts. In the case of facial recognition tools, understanding how social structures often privilege cisgender men can inform an analysis of how these tools operationalize gender in a cis-centric way, treating it as binary and tied to physical traits. While many quantitative analyses of facial recognition technology focus narrowly on comparing performance between cis men and cis women to assess gender bias, one study conducted a mixed methods assessment of how this technology performed on transgender individuals and their experiences with the technology. This example shows that more holistic perspectives can be integrated even in technical assessments.

Input from affected communities can also be incorporated to identify what aspects of an AI system are most relevant to consider in a technical evaluation. For example, through a participatory workshop, one study identified harms that AI systems pose to queer people, such as data misrepresentation and exclusionary data collection, which can inform technical assessments that delve deeper into these harms and consider the lived experiences of queer people. Organizations advocating on behalf of communities –– such as Queer in AI and the National Association for the Advancement of Colored People (NAACP) –– can offer valuable input on which impacts to evaluate, without overburdening individual community members. (At the same time, neither organizations nor individual members fully represent the entire community, and affected communities should not be treated as monoliths. And it is critical to remember that affected communities include not only those impacted by AI’s outputs, but also those involved in its inputs and model development, such as data workers who produce and label data.)

In the case of facial recognition, traditional assessments use metrics like false positive rates to measure the technology’s performance. However, civil rights organizations such as Big Brother Watch offer community input that these metrics can be misleading and suggest practitioners look to more nuanced metrics like precision rates across demographic groups to better understand how the technology impacts different communities. (False positive rates measure the number of errors relative to the total number of people scanned, which can result in a misleadingly low error rate when facial recognition is used to scan large crowds. In contrast, precision rates assess errors against the number of facial recognition matches, providing a clearer picture of the technology’s accuracy.)

Evaluating facial recognition models could also involve input from individuals whose data was used in training. A qualitative assessment might focus on how they were given the agency to provide informed consent, while a quantitative assessment might estimate the percentage of facial images in a dataset collected without consent. Such assessments are important, especially as companies seek to diversify their datasets, which has led to ethically questionable practices like Google reportedly collecting the images of unhoused people without their informed consent to improve the Pixel phone’s face unlock system for darker-skinned users.

Of course, these examples illuminate the limits of a technical assessment, as they do not capture the many significant harms of facial recognition systems and related technologies, including their role in overpolicing and oversurveilling Black and brown communities. So while social context can be more deeply incorporated in technical assessments, this does not negate the need to consider the broader impact of AI on people and social structures.

Methods for considering social dimensions in technical assessments

Literature reviews can be used to incorporate context about social structures of harm into a technical assessment. For example, this quantitative evaluation of racial classification in multimodal models was grounded in a qualitative and historical analysis of Black studies and critical data studies literature on the dehumanization and criminalization of Black bodies. Consistent with this literature, the evaluation found that larger models increasingly predicted Black and Latino men as criminals as the pre-training datasets grew in size. Another example is this evaluation of the ImageNet dataset, informed by a literature review of the critiques of the dataset creation process. The evaluation examined issues of privacy, consent, and harmful stereotypes and uncovered the inclusion of pornographic and non-consensual images in ImageNet. (Literature reviews can also be helpful when evaluating a technical system with respect to large-scale societal impacts. For example, to evaluate the environmental costs of AI systems, this article reviews existing tools for measuring the carbon footprint when training deep learning models.)

Technical assessments can be co-designed with impacted and marginalized communities using processes like Participatory Design, Design from the Margin, and Value Sensitive Design. For example, one study conducted community-based design workshops with older Black Americans to explore how they conceptualize fairness and equity in voice technologies. Participants identified cultural representation –– such as the technology having knowledge about Juneteenth or Black haircare –– as a core component of fairness, while also expressing concerns about disclosing identity for representation. This work could inform a co-designed assessment of how voice technologies represent the diversity of Black culture and how much they learn about users’ identities. Another study used participatory design workshops to broadly examine the perceptions of algorithmic fairness among traditionally marginalized communities in the United States, which could serve as a foundation for co-designing evaluation metrics.

Social science research methods like surveys, interviews, ethnography, focus groups, and storytelling can be used to center the lived experiences of impacted communities when evaluating technical components like model inputs and outputs. Research has shown that surveys on AI topics often decontextualize participant responses, exclude or misrepresent marginalized perspectives, and perpetuate power imbalances between researchers and participants. To move towards more just research practices, surveys should be co-created with impacted communities, and qualitative methods with carefully chosen groups of participants should be adopted. For example, one study used focus groups with participants from three South Asian countries to co-design culturally-specific text prompts for text-to-image models and understand their experiences with the generated outputs. The study found that these models often reproduced a problematic outsider’s gaze of South Asia as exotic, impoverished, and homogeneous. Another study involved professional comedians in focus groups to evaluate the outputs of language models for comedy writing, focusing on issues of bias, stereotypes, and cultural appropriation. Additionally, at a FAccT conference tutorial, Glenn Rodriguez, who was formerly incarcerated, used storytelling to illuminate how an input question to the COMPAS recidivism tool –– which asks an evaluator if the person appears to have “notable disciplinary issues” –– could result in the difference between parole release and parole denied.

When gathering community input through the co-design and social science methods discussed above, it is important to conduct a literature review beforehand to understand the histories and structures of harm experienced by affected communities. This desk research helps reduce misunderstandings and enables informed community engagement.

2. Engage with users, impacted communities, and entities with power to evaluate human-AI interactions

To evaluate the interactions between people and an AI system, it is important to engage with the users of the system, communities affected by the system, and entities that hold significant influence over the design and deployment of the system. 

First, researchers and practitioners can examine how users interact with the AI system in practice and how the system shapes their behavior or decisions. In the case of police use of facial recognition technology, a qualitative assessment could investigate whether and how officers modify the images submitted to the technology. In contrast, a quantitative assessment might measure the accuracy of officer verifications of the technology’s output when they serve as the “human in the loop,” given the risk that they may incorrectly view the technology as objective and defer to its decisions.

However, it is important to recognize that the users of an AI system are not always the communities impacted by the system. For instance, police use of facial recognition in the U.S. often disproportionately harms Black communities, who have been historically oversurveilled. To understand this broader impact of the technology on people, one study used a mixed methods approach to examine how impacted communities in Detroit perceived police surveillance technologies. Another assessment might examine the technology’s impact on encounters that Black activists have with police, as seen in the case of Derrick Ingram, who was harassed by officers after being targeted with the technology at a Black Lives Matter protest.

Moreover, just as researchers and practitioners can uncover how communities are impacted by an AI system, they can also “reverse the gaze” by examining the entities that hold power over the system. In the case of facial recognition, one might examine where police deploy the technology and their decision-making processes that shape deployments. For instance, Amnesty International’s Decode Surveillance initiative mapped the locations of CCTV cameras across New York City that can be used by the police. Their quantitative and qualitative analysis revealed that areas with higher proportions of non-white residents had a higher concentration of cameras compatible with facial recognition technology.

Methods for holistic assessments of human-AI interactions

Human-computer interaction (HCI) methods like surveys, workshops, interviews, ethnography, focus groups, diary studies, user research, usability testing, participatory design, and behavioral experiments can be used to engage with users of an AI system, impacted communities, and the entities shaping the system. For example, one study conducted an ethnography to examine how users –– specifically, judges, prosecutors, and pretrial and probation officers –– employed risk scores from predictive algorithms to make decisions. Another study assessed child welfare service algorithms using interactive workshops with front-line service providers and families affected by these algorithms. Still another study conducted interviews with financially-stressed users of instant loan platforms in India to investigate power dynamics between users and platforms, possibly influencing Google to improve data privacy for personal loan apps on its Play Store.

Investigative journalism methods like interviews, content and document analysis, and behind-the-scenes conversations with powerful stakeholders are valuable for examining how entities influence or deploy an AI system. When an AI system operates as an opaque black box, legal channels like personal data requests under the California Consumer Privacy Act or public records requests under the Freedom of Information Act can enable access to relevant information about how powerful stakeholders shape and use the system. For example, a public-service radio organization in Germany analyzed whether a food delivery company improperly monitored its riders by creating an opportunity for riders to request the data the company tracks under the European General Data Protection Regulation and then share it with the organization for analysis. Researchers at the Minderoo Center for Technology and Democracy used freedom of information requests and document analysis to examine how UK police design and deploy facial recognition technology. While most commonly used by third-party researchers, these methods are not limited to external actors; internal practitioners working on AI ethics and governance can use similar methods to assess how research and product teams design AI systems before they launch.

3. Evaluate AI’s impact on social systems and people’s rights with specific objectives to enable accountability

Assessing the impact of an AI system requires considering not only how different groups of people interact with it but also its role within broader social and legal contexts. Important values such as privacy and equity are embedded in legal systems, and evaluating a technology’s impact on people’s rights can support advocacy and policy efforts. In the case of facial recognition, one might qualitatively examine the technology’s impact on the rights to free expression, data protection, and non-discrimination, such as protections codified in the First Amendment, the California Consumer Privacy Act, and the Civil Rights Act in the U.S.

It is also important to consider the impact of AI on social systems, like mass media, the environment, labor markets, political parties, educational institutions, and the criminal legal system, as well as effects on social dynamics like public trust, cultural norms, and human creativity. In the context of facial recognition, for example, a broader assessment might examine how community safety and public trust in institutions are impacted when this technology is adopted more widely, not only by the criminal legal system but also by schools, airports, and businesses.

For such assessments of broader impacts to be impactful and support holding AI actors accountable, they should be designed with specific objectives and outcomes in mind. For example, an assessment of facial recognition might focus on its impact on the right to free expression and target entities that shape governance around the technology, such as the U.S. Government Accountability Office. Moreover, the assessment should aim for a concrete outcome, like determining whether the use of facial recognition meets specific legal standards, rather than producing a broad, open-ended list of legal concerns. Although specificity is often associated with technical evaluations, research has identified that when evaluations of broader impacts are made specific, they can prompt stakeholders to take action, help advocates cite concrete evidence, and enable more precise and actionable policy demands.

When an assessment is made specific, it is important to prioritize the most relevant systemic impacts. For example, one might focus on facial recognition’s effect on free expression since surveillance can significantly inhibit political dissent, which is vital for social justice movements. To operationalize this investigation, one could evaluate how the presence of the technology at protests affects activists’ participation or how the application of the technology online affects the use of social media for activism.

Methods for considering broader societal impacts in assessments

Social science research methods like surveys, forecasts, interviews, experiments, and simulations can be used to evaluate the impact of AI on social systems and dynamics. For example, one study analyzed the chilling effect of peer-to-peer surveillance on Facebook through an experiment and interviews. Another assessment used simulation to examine how predictive algorithms in the distribution of social goods affect long-term unemployment in Switzerland. To understand the environmental impact of AI systems, one study estimated the carbon footprint of BLOOM, a 176-billion parameter language model, across its lifecycle, while another argued that assessments should focus on a specific physical geography to highlight impacts on local communities and shape local actions that can advance global sustainability and environmental justice.

Legal analysis is a useful method for assessing the legal compliance of an AI system’s design and usage. This method involves examining how the AI system may infringe upon rights by reviewing relevant case law, legislation, and regulations. For example, one audit evaluated Facebook’s ad delivery algorithm for compliance with Brazilian election laws around political advertising. Another study examined the London Metropolitan Police Service’s use of facial recognition with respect to the Human Rights Act 1998, finding that the usage would likely be deemed unlawful if challenged in court.

Power mapping can be used to identify target entities and design assessments that foster accountability. This method can help identify what will motivate influential individuals and institutions to take action. For example, the Algorithmic Ecology tool mapped the ecosystem surrounding the predictive policing technology Predpol, outlining PredPol’s impact on communities and identifying key actors across sectors who have shaped the technology. The Algorithmic Ecology tool has been crucial for understanding the extent of PredPol’s harms, challenging its use, and offering a framework that can be applied to other technologies.

Not All Assessments Are Created Equal

We discuss a range of approaches to assess the impacts of a given AI system –– at the technical system layer, the human interaction layer, and the systemic impact layer. However, efforts across these layers may not necessarily carry equal weight in every context, and researchers and practitioners should prioritize certain layers based on the specific AI system being assessed. The greater the system’s potential to affect people’s rights, the more critical it is to consider its impact on users, communities, and society at large. 

For example, an assessment of police use of facial recognition should center its significant impact of oversurveilling and overpolicing communities of color, rather than focusing narrowly on its performance on communities of color, which can result in technical improvements that perfect it as a tool of surveillance. In contrast, an assessment of a voice assistant like Siri, which may pose a lower immediate risk, could initially focus on the technical system. Yet, the social dimensions are still crucial to consider at this layer. For instance, understanding the dominance and enforcement of standardized American English, practitioners might explore how the voice assistant performs on African American Vernacular English and may exclude or misunderstand Black American speakers.

By prioritizing certain kinds of assessments, we can not only gain a deeper understanding of the impacts of AI technology, but also shape decisions around its design and deployment, and identify red lines where we may not want the technology to be developed or deployed in the first place. Additionally, by assessing AI systems that have real-world influence, we can draw attention to their actual, everyday impacts rather than hypothetical concerns.

Our recommendations consider AI technology not merely as a technical tool, but as a system that both shapes and is shaped by people and social structures. Understanding these broader impacts requires a diverse set of methods that are appropriate for the specific AI system being assessed. Thus, we encourage researchers and practitioners to adopt more holistic methods and urge policymakers to support and incentivize these approaches in AI governance. Moreover, we hope this work fosters the development of assessments that scrutinize systems of power and ultimately uplift the communities most impacted by AI.

Acknowledgements

Thank you to Miranda Bogen and Ozioma Collins Oguine for valuable feedback on this blog post. We also acknowledge the Partnership on AI’s Global Task Force for Inclusive AI Guidelines for insights on participatory approaches to understanding the impacts of AI systems.