Skip to Content

CDT Research

Intersectional Disparities within Automated Hate-speech Detection Across US Centered Social Media Content 

This blog was authored by Amanda Zaner, Summer Extern for CDT Research. 

Navigating the socio-linguistic nuances of communication across social media platforms is no simple task. When it comes to online content moderation, platforms frequently rely on automated detection systems for the challenge of identifying and addressing abusive language posted in these digital environments. Recent research has increasingly focused on evaluating the ability of machine learning language algorithms to address hate speech. This entails examining the wider social implications of moderation practices on social media content, and exploring community-centered research efforts to reduce bias in automated algorithms that disproportionately affect marginalized groups

Despite extensive research on English-language posts across social media platforms, hate-speech detection systems still struggle with implicit, non-pejorative, and reclaimed English language. This complicates accurate identification of abusive content aimed at individuals based on identity characteristics such as gender, race, ethnicity, or disability, especially when multiple aspects of a person’s identity are targeted simultaneously (e.g., women of color and LGBTQ+ community members with disabilities). To mitigate negative impacts of current automated content analysis and moderation practices, developers should incorporate more diverse, representative perspectives into training data and research methods used for developing hate-speech detection algorithms. 

Within the context of US-centered social media analysis, effective hate-speech detection algorithms are particularly important because of the rise in potentially abusive content associated with US national elections. A recent CDT report found that women of color political candidates faced a disproportionate amount of hate speech online in the summer leading up to the 2024 elections. That study used hate-speech detection algorithms to count the number of posts directed at candidates that contained hate speech. If anything, the report likely understated the problem since the flaws in these algorithms mean they under-identify hate speech. An earlier CDT report likewise found that in instances where political candidates are targeted, women of color experience more violent online harassment, reflecting the intersection of racist and misogynist sentiment. Hateful or abusive posts, especially when compounded by misinformation, contribute to a chilling effect that prompts many targeted individuals to self-censor and reduce their online presence.

Challenges in demystifying platform policies on hate speech partially stem from a lack of transparency around content moderation practices, including when and how they use AI systems. Major platforms like Instagram, Facebook, and X (formerly Twitter) use machine-learning algorithms to identify problematic content. Natural Language Processing (NLP), a subset of machine learning, is crucial for analyzing and generating text, with Large Language Models (LLMs) being particularly effective for reviewing social media content due to their ability to process large datasets and adapt to new instructions. This is a powerful feature for content moderators who are monitoring highly polarized and fast-paced digital environments on a daily basis. 

AI tools have transformed online content analysis and moderation, but biases in model training can lead systems to incorrectly flag posts, misidentify hate speech targets, and provide misleading context explanations (known as “hallucinations”). Choices made by researchers or moderators regarding data, annotation categories, and the quantity and types of examples used to prompt algorithms affect model performance and introduce algorithmic biases. Research shows that hate-speech detection often misindentifies or stereotypes individuals using dialects or group-specific language, like African American English (AAE) or LGBTQ+ vernacular, especially in contexts of language reclamation, self-identification, or educational content. 

Research also shows that automated content moderation on social media platforms often disadvantages marginalized groups. Despite Meta’s uniform hate speech policies, employees report that in practice Instagram’s automated algorithms detect more hate speech targeting white people than it detects targeting black people. Additionally, Black users also face a 50% higher likelihood of automatic account disabling. A study on intersectional hate speech found that only 17% of posts containing “misogynoir” (anti-Black forms of misogyny) on X were classified properly by popular hate-speech detection algorithms. Another study concluded that transgender Facebook users also face higher levels of removal of content when posting about their identities, such as by “coming out” or using terms like “queer” which is commonly reclaimed within the LGBTQ+ community. Studies on hate-speech detection language model performances have found that posts containing reclaimed terms are more likely to be considered harmful when used in counter speech, educational explanations, or discussions of self-identity by LGBTQ+ users. Independent research remains vital for understanding biases in automated hate-speech detection models. Yet support for external social media monitoring tools is limited, due to new data access restrictions from major platforms in the US.  

Expanding transparent, accessible, and community-driven research is essential for improving AI tools like automated content moderation and reducing bias in hate-speech detection. For example, group-specific approaches to LLM research center marginalized communities, and seek to understand how models respond to community-specific language use. Transparent and accessible research is vital for improving representation, and for building confidence in the merits of these approaches. Such research has shown, for example, that in automated hate-speech detection, LLM algorithms can quickly adapt to new language and concepts through additional prompting

Improving access to datasets is also a crucial step. A study reviewing hate-speech detection research found that, across published articles, 51% of the datasets used are publicly available, and less than 35% of datasets mention methodology. Tools cataloging hate speech datasets for public availability (e.g., Detoxify, QueerReclaimLex, HateSonar, and HateXplain to name a few) help to crowdsource research, offering opportunities to center expertise and collaboration across disciplines. I would also argue that integrating perspectives from content moderators is vital. Online community members can contribute to the review of content in accordance with platform rules (called “reactive moderation”). 
Expanding diverse, community-driven research, improving method transparency, and increasing dataset access are crucial for developing fairer AI tools and mitigating adverse effects of current social media content moderation practices. Context analysis, fueled by interdisciplinary collaboration, enhances data collection quality, leading to more representative training of hate-speech detection models and stronger ethical considerations for people-centered research.