CDT Research, Free Expression, Privacy & Data
Learning to Share: Lessons on Data-Sharing from Beyond Social Media
What role has social media played in society? Did it influence the rise of Trumpism in the U.S. and the passage of Brexit in the UK? What about the way authoritarians exercise power in India or China? Has social media undermined teenage mental health? What about its role in building social and community capital, promoting economic development, and so on?
To answer these and other important policy-related questions, researchers such as academics, journalists, and others need access to data from social media companies. However, this data is generally not available to researchers outside of social media companies and, where it is available, it is often insufficient, meaning that we are left with incomplete answers.
Governments on both sides of the Atlantic have passed or proposed legislation to address the problem by requiring social media companies to provide certain data to vetted researchers (Vogus, 2022a). Researchers themselves have thought a lot about the problem, including the specific types of data that can further public interest research, how researchers should be vetted, and the mechanisms companies can use to provide data (Vogus, 2022b).
For their part, social media companies have sanctioned some methods to share data to certain types of researchers through APIs (e.g., for researchers with university affiliations) and with certain limitations (such as limits on how much and what types of data are available). In general, these efforts have been insufficient. In part, this is due to legitimate concerns such as the need to protect user privacy or to avoid revealing company trade secrets. But, in some cases, the lack of sharing is due to other factors such as lack of resources or knowledge about how to share data effectively or resistance to independent scrutiny.
The problem is complex but not intractable. In this report, we look to other industries where companies share data with researchers through different mechanisms while also addressing concerns around privacy. In doing so, our analysis contributes to current public and corporate discussions about how to safely and effectively share social media data with researchers. We review experiences based on the governance of clinical trials, electricity smart meters, and environmental impact data.
Clinical Trials 
In most cases, the FDA requires companies and research centers to share data about the clinical trials they use to verify the safety and efficacy of a medical product as a condition of bringing that product to market. Group-level summary data and metadata on the studies’ methodologies is made publicly available on ClinicalTrials.gov. Voluntary mechanisms such as the Yale Open Data Access Project (YODA) enable those running clinical trials to securely share additional anonymized data with independently vetted researchers for independently approved projects. In general, researchers use clinical trial data to monitor the safety and efficacy of certain drugs, assess the validity of trial methodologies, and at a more meta-level, assess the extent to which companies are complying with their data publishing requirements.
Electricity Smart Meters
Smart meters record electricity usage and report this information back to the utility for billing purposes using radio frequency networks. Researchers also use smart meter data to help evaluate and improve the energy efficiency of buildings (Adams et al., 2021), inform energy demand response strategies (National Council on Electricity Policy, 2008), and improve battery management (Zheng et al., 2019). Some smart meter data sets collated by academics and governments exist, but it is more difficult for researchers to request data from utilities, especially for a specific geographical area. Some states have pathways for researchers to get electricity consumption data directly from utilities, though that data cannot exceed certain aggregation and anonymization thresholds. Often, these thresholds are easy to understand and implement, but in their simplicity can be too conservative in some cases or too liberal in others, needlessly preventing harmless research, or failing to protect some individuals, respectively.
Environmental Impact Statements
Government agencies are required to assess the environmental impact of any project that uses federal land, federal tax dollars, requires federal authorization, or is under the jurisdiction of a federal agency (Middleton, 2021). These assessments come in the form of an environmental impact statement (EIS), which the public can then comment on (U.S. EPA, n.d.-b). Researchers use the EIS process itself as a source of political leverage for citizen science, and also use historical data to both assess methods of mitigating environmental harm (Marcot et al., 2001) and evaluate and improve the effectiveness of the environmental review process itself (O’Faircheallaigh, 2010). However, EIS data does not come in a standardized form that can be easily used by researchers at scale. In the United States in particular, EISs are allowed to exclude a lot of information under the protections of trade secrecy (Lamdan, 2017). Alternatively, in the UK, there is a public interest test: authorities “can refuse to provide information only when the public interest in maintaining the exception outweighs the public interest in disclosure” (Information Commissioner’s Office, 2022).
Lessons for social media companies from other industries
Using these three cases, we outline ten lessons that social media companies, policymakers, and others should consider when developing policies to improve researcher access to data:
- Sharing data with researchers can help make more informed policy decisions. Clinical trial data, smart meter electricity data, and data underlying environmental impact statements are all governed in a way that lets researchers use the knowledge they gain to help inform the policymaking process. When designing mechanisms to give researchers access to social media data, policymakers should consider designing analogous feedback loops.
- Sharing data can let researchers double check otherwise unverifiable corporate claims. Social media companies often respond to public criticism by making changes to their systems, but there is no way for independent researchers to verify the effectiveness or veracity of these changes. Other sectors show a way forward — clinical trial data is shared in a way that is particularly designed to allow third-party researchers to stress-test and verify whether medical products work. Environmental impact statements further shift the paradigm, allowing the public to identify shortcomings or knock on effects before an intervention is rolled out.
- The “denominator problem” can be addressed without compromising privacy. When an independent researcher establishes some finding based on the limited data they have available, there is no way for them to precisely determine the overall size of the finding relative to the social media platform in question. For example, if they find that 10% of users in a given sample of data share misinformation it’s hard to know what that means about the population of all users on the platform. Experience in other industries show that aggregation and anonymization techniques can allow this kind of population related information to be shared without compromising individual privacy.
- Addressing the “black box” problem will make research more widely applicable. Researchers struggle to use data sharing tools provided by social media companies because they offer little information on how a given data set was produced. Clinical trials, on YODA, ClinicalTrials.gov, and elsewhere give researchers the context they need by including metadata about how the data was generated, such as trial protocols and statistical analysis methods.
- Transparency mechanisms let civil society serve as data sharing watchdogs. The lack of data available for researchers, particularly those in civil society, undermines attempts at meaningful transparency and accountability for social media. The EIS review process and FDAAA Trials Tracker, which uses ClinicalTrials.gov data to calculate how many covered trials have reported their results, show how sharing even a little data with researchers can contribute meaningfully to oversight.
- Standards make shared data usable. Standards are an important way for researchers to know what data to expect and how they can expect to receive it. Robust standards set by the FDA and NIH have made clinical trial data more useful for researchers. A lack of those standards has made EIS data less systematized and thus less useful. Today, with each platform having its own protocol for sharing data, social media falls closer to the latter camp.
- Data sharing should be flexible to accommodate public crises. The experiences from the three industries show that normative trade-offs can be made when it comes to public crises and sharing data. For example, ClinicalTrials.gov expedited and broadened its data sharing about COVID-19 vaccines, though many in the medical community called for even greater transparency than they actually provided. Social media should support greater access to data when the public interest is particularly important such as in the case of events such as natural disasters and elections.
- Ease of understanding is a factor to consider in privacy. Social media companies tend to be opaque about the methods they use to ensure user privacy. Examples from other industries show how privacy rules for preparing and sharing data can be intuitive and easier to understand. The 15/15 rule with smart meter data (where each geographical unit of data requested must include at least 15 commercial customers, and no customer may make up more than 15% of the total power usage) and the 18 direct identifiers in HIPAA, which cannot be shared in clinical trials, for example, are easy for the public to understand and likely easier to enforce, though they also come with sacrifices in effectiveness.
- Data access can be tailored to different use cases. A tiered access approach is sometimes posited for and used by social media companies when it comes to access to data. In other industries, access is also more specifically tailored for the capabilities and goals of different types of researchers, such as the California Public Utilities Commission’s (CPUC) distinction between government and academic researchers. This approach allows for greater flexibility where researchers are more likely to be able to access the most useful type of data for their research.
- Diverse data stewards offer new affordances. In each of the industries we examined, different actors play a role in facilitating the sharing of data, including private, government, academic, and civil society organizations. EIS data, for example, is organized both by government actors, such as the EPA, and academic actors, such as Northwestern. Clinical trial data is also shared by multiple actors, through compulsory and voluntary data sharing mechanisms. This expands the range of options (types of data, requirements, limitations, etc.) available to researchers. Social media does not benefit from this diversity because as of now, private companies are the sole stewards of data.
 The analysis of clinical trials in this report is based on a forthcoming law review article by Christopher Morten, Gabriel Nicholas, and Salomé Viljoen, which in much greater depth considers lessons social media can draw from the clinical trial sector’s legal and technical approaches towards sharing data with researchers. For a copy of the latest draft of that article, contact the authors at [email protected], [email protected], or [email protected] The article is cited here as “Morten et al., forthcoming”.