Skip to Content

Government Surveillance, Privacy & Data

Having Your (Big Data) Cake and Eating It Too

Sometimes new research appears to be magical. Or at least, new research can accomplish things that previously appeared to be uncanny, unintuitive, or even impossible.

Such is the case with new research on privacy and big data, called RAPPOR. New results from Úlfar Erlingsson (Google), Vasyl Pihur (Google), and Aleksandra Korolova (USC) to be presented at next month’s Conference on Computer and Communications Security (ACM CCS) – a top computer security research venue – demonstrates the feasibility of collecting granular data about users in a privacy-preserving manner.

Essentially, this means a service can collect useful data about individuals without revealing what a specific individual has done. That may sound like magic though, so let’s walk through what they do.

The idea harkens back to survey research methods from the 1960s developed to ask people about sensitive topics and get honest answers. Respondents in some cases weren’t convinced that survey researchers could keep responses confidential and anonymous. Given this lack of confidence, respondents might be unlikely to honestly answer questions like “Do you currently have a sexually transmitted infection (STI)?” The solution was simple: when asking a sensitive question, researchers asked the person to flip a coin. If the coin came up heads, the respondent was told to answer Yes, regardless of the true answer. If the coin came up tails, the respondent was told to tell the truth, Yes or No. Effectively, this meant that any given Yes response was completely deniable by the respondent, preserving the privacy of those that answered Yes. However, because the researcher knew to expect 50% of responses to be Yes (purely by chance, because coin flips are random with a probability of 1/2), she could account for this when counting the total number of Yes responses from the sample. (A survey of 100 people with no STIs would include, on average, 50 Yes responses; so a response rate of 60 Yes responses indicates that about 10 of the 50 truth-telling respondents have STIs.) This technique allows researchers to derive the real number of Yes answers, without being able to identify any particular respondent that answered Yes truthfully.

The RAPPOR research has extended this model to reporting analytics about how users use software, like the Chrome browser. For each quantity of interest – for example, “Does this user allow cookies to be set?” – Chrome can record a Yes/No answer and randomly choose the response as Yes/No or the truth. Google or anyone that gets access to this data will not be able to tell if the answer is true or not. Chrome can in fact record a list of many Yes/No answers about how Chrome is being used, but the individual list of Yes/No answers reveals nothing certain about the actual truth. Only when that list of measurements is added together with many more such lists from other browsers does the information become useful, and without any loss in the aggregate statistics.

There are some fascinating implications of this kind of technique. Because the individual records are not sensitive, RAPPOR eliminates some of the tensions we’re concerned with in collecting data. We’ve written in a number of venues about the risks of unconstrained collection of data, including breaches, government surveillance, internal snooping, and unwanted secondary uses. RAPPOR data is breach-proof, and not useful at the individual level for government surveillance, snooping, and secondary uses.

Of course, RAPPOR isn’t useful for applications where an analyst necessarily needs raw individual-level data; RAPPOR isn’t designed for those uses. However, I’ve begun to call RAPPOR a “halfway house to big data”. Instead of collecting all the raw data that could ever possibly be useful and keeping it indefinitely, big data analysts can instead use a RAPPOR-like structure to see if there is “signal” in data sources and then affirmatively choose to collect raw data, in an opt-in manner, avoiding most of the risks groups like CDT have identified in unbridled big data collection.