Understanding Automation and the Coronavirus Infodemic: What Data Is Missing?

April 22, 2020 / Emma Llansó

As government leaders, policymakers, and technology companies continue to navigate the global coronavirus pandemic, CDT is actively monitoring the latest responses and working to ensure they are grounded in civil rights and liberties. Our policy teams aim to help leaders craft solutions that balance the unique needs of the moment, while still respecting and upholding individual human rights. Find more of our work at cdt.org/coronavirus.

One unanticipated consequence of the global COVID-19 pandemic has been the changes to online content moderation on some of the biggest social media platforms. Facebook and YouTube have each announced significant changes to their approaches to identifying and removing potentially rule-violating content, taking human moderators out of the loop and relying more heavily on automation and machine-learning classifiers to triage millions of posts, images, and videos. (Twitter has also announced that it will rely on automation to deal with COVID-19-specific disinformation, but has not said that staff involvement in content moderation will decrease.)

There’s no question that these companies made the right call in sending their moderator staff home last month, to practice safe social distancing and to abide by lockdown orders during this pandemic. There are, however, many questions about the consequences this shift to automation is having for people’s access to information and ability to report on developments during this global public health crisis—and it’s not clear how those questions can be answered.

That’s why today, CDT joined 75 organizations and researchers in publishing an open letter to social media companies and other content hosts, urging them to enable future research and analysis about the “infodemic” side of COVID-19 by preserving information about what their systems are automatically blocking and taking down. Without understanding what kind of content is staying up, coming down, or never making it online in the first place, it will be hard to assess the efficacy of efforts to share vital public health information while combating the spread of coronavirus scams and pandemic profiteering.

One particular area of concern is takedown of content that evaluates or reports on governments’ responses to COVID-19. The Russian government is reportedly asking social media companies to censor media outlets that report what the authorities deem to be “false information that is socially significant” about the coronavirus. As governments exercise emergency powers to control people’s movements, there are rising reports of police brutality and abuse of power, including in Paraguay, the Philippines, India, Nigeria, and Kenya. Social media will continue to be a crucial tool for people to report on and document human rights abuses throughout the duration of the pandemic, and it is crucial that this speech does not get swept up in companies’ automated moderation systems or censored at the request of governments.

The COVID-19 crisis is unfolding in the midst of a years-long policy debate over the role of automation in content moderation, with policymakers around the world pushing for mandatory content filtering or “proactive measures” to block illegal material. As CDT and many other advocates and experts have explained, automatic filters are imprecise and prone to both over- and under-blocking in ways that can disproportionately impact already marginalized speakers and groups. Sophisticated machine learning techniques for automating content analysis do not magically solve these problems, and even the most advanced filtering systems still essentially function as a prior restraint on speech.

While humans are also prone to our own biases and errors, human involvement in moderation is an essential safeguard to mitigate the worst effects of filtering. Humans are able to bring cultural, linguistic, and historical context to their analysis of other people’s speech in a way that machines cannot replicate. This allows social media services, in normal operating circumstances, to use a mix of automation to flag concerning content and humans to actually evaluate it against the company’s policies. Human review is also a key feature of companies’ appeals processes, which are an important procedural safeguard against the erroneous decisions that both humans and machines can make. Facebook, YouTube, and Twitter have all been clear with their users that they cannot provide appeals while their moderator staff is so reduced–a necessary piece of crisis-response transparency, but not a practice that should endure long-term.

As CDT and many human rights advocates have noted, states’ emergency powers “must be time-bound, and only continue for as long as necessary to address the current pandemic.” It’s important to recognize that the automation-reliant version of content moderation currently in use on these services is, itself, an “emergency power” for the social media companies. It’s a stopgap that is necessary during a particular set of emergency circumstances: the inability to have human moderators do their jobs in a way that protects both their health and the privacy of the services’ users. Automated content moderation cannot be the new status quo.

These questions around automated content moderation and the COVID-19 crisis highlight how difficult it is to conduct solid empirical research on our online information environment. The data necessary for this research is held by multiple private companies, and some important information, such as the amount and type of content blocked at upload, may not be recorded at all. But there are genuine and significant privacy concerns with companies retaining this data, whether it’s made available to third-party researchers or not. When companies retain data, they increase the risk that it gets exposed through a data breach or is demanded by government officials—and we know that governments are eager to get their hands on data related to COVID-19. Even if companies retain only anonymized content, and not the associated user’s personal information, that content may have been intended for a non-public audience and may contain identifiable information. Companies have an obligation to respect user privacy whenever they retain user information, and need to incorporate appropriate safeguards such as de-identification, data aggregation and retention, access, and purpose limitations.

But no one can do research on data that doesn’t exist. We will need reliable information about what is actually happening in our online information environment during this crisis, as the tech companies’ responses to COVID-19 are sure to be a main focus of tech policy debates for years to come. Companies need to anticipate the questions that will be asked, and listen to the questions of health experts, researchers, and journalists, to understand what additional information they may need to preserve right now. If the trade-off with privacy and security of preserving certain kinds of information is too great, they also need to explain that, and to clarify what that means about the kinds of claims that can and can’t be made about the efficacy and error rates of their moderation systems. If we’ve learned anything from the pandemic, it’s that for any statistic we see, we need to understand how it is shaped by data that isn’t there.

Understanding Automation and the Coronavirus Infodemic: What Data Is Missing?

Related Reading

Context Before Code: Meta’s Oversight Board Policy Advisory Opinion on the Word “Shaheed” Calls for Language and Cultural Nuance in Content Moderation

EU Tech Policy Brief: March 2024

CDT’s Aliya Bhatia Testifies Before Colorado Senate Committee Raising Equity, Free Speech and Privacy Concerns with Mandating Use of Age Verification Tech