De-Identification Should Be Relevant to a Privacy Law, But Not an Automatic Get-Out-of-Jail-Free Card

April 1, 2019 / Joseph Jerome

The most important definition in any privacy law is the scope of information that is covered by that law. A line must be drawn somewhere between personal and non-personal data, the argument goes, or else laws will capture all information even if it presents no risks to an individual’s privacy. This oversimplifies how data is collected and processed, but it helps to explain why many stakeholders recommend exempting de-identified data, which includes anonymized, pseudonymized, and aggregated information, from the scope of privacy legislation. However, completely exempting these types of data is not just untenable; it is dangerous.

De-identification techniques vary in effectiveness, and may not work to hide individual identities in some cases.

While legal definitions attempt to present personal data and de-identified data as a binary, the practical reality is that the identifiability of information depends on what universe of data is available and how that data has been manipulated or siloed. De-identification is the use of technical and administrative processes to prevent an individual’s identity from being connected with other information. However, there is no bonafide standard for personal information to be de-identified. Simply removing a name or ID number from a data set may not ensure that it is de-identified. Anonymous search queries, social network data, and geolocation data all permit re-identification, and we have seen time and time again how information that was claimed to be “anonymous” was easily re-identifiable.

New York City officials, for example, accidentally revealed the detailed comings and goings of individual taxi drivers in a case of a public release of data that was poorly de-identified, but just a handful of random location data points are uniquely identifiable 95% of the time. Location data has proven especially difficult to de-identify.
Medical records are have also proven difficult to de-identify. In 2016, the Australian government released an anonymized dataset of medical billing records, including prescriptions and surgeries. Researchers quickly noted “the surprising ease with which de-identification can fail” when additional datasets are cross-referenced.
Metadata can also be used to quickly identify individuals. Looking at 200 tweets, researchers were able to use associated metadata like timestamps, number of followers, and account creation time to identify anyone in a group of 10,000 Twitter users 96.7% of the time. Even when muddling the metadata, a single person could still be identified with more than 95% accuracy.

The problem is that, as businesses, researchers, and governments collect and disclose more and more data, any given dataset becomes high-dimensional. High-dimensional datasets, which have dozens of data points, are resistant to basic de-identification methods. This type of data becomes unavoidable in a world with ubiquitous connectivity and rampant tracking across different devices and contexts. In other words, it can be technically impossible to de-identify data, and the risk if something goes wrong is significant.

Even if de-identification can protect the privacy of individuals, it does not always prevent harms to groups of people.

The potential misuse of aggregated information should not be understated. Infamously, Strava, a fitness data platform, created an “aggregated” heat map that revealed secret information about the location and movements of military service members in conflict zones, including the alleged locations of secret U.S. military installations. While no individual service member was identified, the harm of this revelation was obviously severe. This highlights some of the larger ethical issues that emerge with open data and public data sharing by default.

An underappreciated issue is that while companies view de-identification as a useful safety value from privacy regulation, we should also acknowledge that privacy rules are often one method for addressing public concerns about data-driven discrimination or targeting of protected groups. A safe harbor for de-identification may not impact an individual’s privacy per se, but it does nothing to address larger manipulative and discriminatory behaviors.

Privacy law may not have traditionally dealt with these considerations, but their time has come. Industry and lawmakers should not avoid these challenges by cabining them into an unregulated space of de-identified data.

So what is the policy solution? We would make three key recommendations.

First, privacy legislation should not categorically exempt de-identified data from privacy and security requirements. An email address is often pseudonymous, but it hardly should be excluded from privacy protection. Certain privacy rights that apply to easily identifiable personal information like names, account information, and other data appended to a single user profile should not be afforded to de-identified data. However, legislation should generally protect information that is reasonably linkable to a person or a consumer device, provide mechanisms for judging evolving de-identification tactics over time, and provide a higher standard for any data shared or made available to the public.

Second, finding the proper balance is difficult to do in legislation alone, and regulatory guidance and rulemaking can be helpful if de-identification is included in any U.S. privacy law. The Health Insurance Portability and Accountability Act (HIPAA) has one of the most advanced de-identification regimes, but it was a product of ongoing regulatory rulemaking. (Broad legislative definitions of de-identified data has been one of our primary concerns with the Washington Privacy Act, which otherwise claims to bring GDPR-type privacy protections to the U.S.) Privacy scholars and computer scientists have offered an array of useful suggestions that recognize, as the Federal Trade Commission does, that “the nature of the data at issue and the purposes for which it will be used are also relevant” considerations with respect to de-identification.

Third, we need a “trust, but verify” solution that incorporates law and policy. First, companies should be required to describe their methods for de-identifying personal information as part of an overarching transparency regime to provide meaningful external accountability. Second, contractual agreements can be useful, particularly restrictions on the flow and use of information. This should go beyond existing practices, however, and require reasonable and affirmative efforts to oversee how third parties and other partners use de-identified information.

De-Identification Should Be Relevant to a Privacy Law, But Not an Automatic Get-Out-of-Jail-Free Card

Related Reading

CDT Files Comments with DOJ in Response to Advance Notice of Proposed Rulemaking on Bulk Sale of Data

CDT’s Matt Scherer Testifies Before Connecticut Senate’s General Law Committee on Senate Bill 2, An Act Concerning Artificial Intelligence

CDT Europe’s AI Bulletin: April 2024