This spring some of the biggest product launches in tech have been around a surprising topic: privacy. At Google I/O, the company announced a long list of new privacy features, including the ability to auto-delete stored data, on-device processing to limit Google’s access to identifiable data, and browser features for limiting third-party data collection. At Facebook’s F8 developer conference, CEO Mark Zuckerberg announced that the company will be “privacy-focused” going forward. Apple also emphasized privacy at its recent launch event.
Do these statements portend a larger sea change in how tech companies operate, or are they primarily a marketing strategy? Either way, companies are responding to sustained public attention on privacy (and privacy failures) over the last year. Instead of an afterthought to cool new features, now privacy is a cool new feature.
Despite the obvious business value of privacy, we still hear it pitted against “innovation” in legislative debates. In March, CDT testified at a congressional hearing framed around the impact of the GDPR and CCPA on innovation and competition. We pushed back on misleading testimony about the GDPR’s compliance burden, noting that the law has spurred some companies to use less invasive advertising models and invest in new data security systems. And one year later, the GDPR has set the template by which other countries and even states are modeling laws to govern how companies use information.
The privacy versus innovation framing ignores the potential for legislation to actually accelerate privacy enhancing technologies (PETs), not only by explicitly encouraging them but also by nudging corporate investment toward new, more data protective models. Companies and researchers have developed promising and viable techniques for extracting useful insights from data while obscuring identifiable personal information. Sometimes referred to as privacy-preserving data analysis (PPDA) or privacy-preserving machine learning (PPML), these methods have the potential to become widely accessible and are already appearing in the market. One startup, Canopy, aims to deliver personalized recommendations (for articles, songs, podcasts, etc.) without collecting any identifiable personal information. Instead of slowing tech sector growth, privacy protections can stimulate more investment in the development and democratization of PPDA.
On-device Processing and Federated Learning
On-device processing generally refers to methods of analyzing data in which the company never receives information that can be linked to an individual. For example, Canopy explains that it uses people’s behavior and preferences to make content recommendations, but the company only receives an anonymized model of users’ preferences so it cannot learn what content any individual watched or read. One promising approach to on-device data analysis is called federated learning. In federated learning, data from many people’s devices locally contribute to training a machine learning model without having to be collected in a centralized database.
Google developed a federated learning system to improve its keyboard suggestions without learning what any individual user is typing. As the company wrote in a 2017 blog post, “When Gboard [Google keyboard] shows a suggested query, your phone locally stores information about the current context and whether you clicked the suggestion. Federated Learning processes that history on-device to suggest improvements to the next iteration of Gboard’s query suggestion model.”
Secure Multiparty Computation (SMC)
SMC methods allow meaningful information to be learned from multiple encrypted datasets without having to decrypt the data. This can allow data held by multiple parties to be analyzed without revealing the input data itself among the parties. SMC was famously used to optimize trading in a Danish sugar beet auction without revealing the amounts that each beet farmer was willing to sell for or able to produce.
Data Perturbation and Differential Privacy
Another technique for processing data while obscuring personal information is to manipulate (or perturb) a dataset so that it cannot be used to learn information about any individual. A simple illustration of perturbation in a voter poll is to ask each respondent to flip a coin out of sight of the pollster. If the coin lands on heads, the respondent gives the true answer of whom they plan to vote for. If it lands on tails, the respondent flips another coin and answers candidate A if it lands on heads and candidate B if it lands on tails, regardless of whom the respondent actually plans to vote for. This introduces some “noise” into the final dataset—for each individual, the pollster won’t know if their answer is true or based on a coin flip. In this example, the dataset is expected to have the true answer for about 75% of respondents, but the parameters can be adjusted to provide more or less noise. If the right amount of noise is introduced, the dataset can still provide valuable election projections but prevent anyone from knowing who any individual intends to vote for. More sophisticated algorithms have been developed to calculate the amount of noise required for a particular dataset to be what’s known as “differentially private”—able to provide useful insights without allowing personal information to be reconstructed.
PPDA is not a panacea for preventing informational harms. It doesn’t prevent the results or use of data analysis from creating discriminatory or unfair outcomes. If not done correctly, it can be just as vulnerable as any other de-identification or anonymization method. It can require increased computing power and more sophisticated model design. But PPDA is a promising and fast growing area of privacy innovation that should not be overlooked. It can play an important role in bringing about Zuckerberg’s pronouncement that “the future is private.” And it shows that investment in technological innovation doesn’t have to slow down in response to data protections.