Netflix recently announced winners of the one-million-dollar “Netflix Prize” and its plans for a new competition, creatively dubbed “Netflix Prize 2.” Although details of this second contest still haven’t been made public, the New York Times has reported that competitors will be challenged to “model individuals’ ‘taste profiles’,” based on a dataset that will hold “demographic and behavioral data,” including information about members’ ages, gender, ZIP codes, genre preferences, as well as their rental histories and movie ratings.
The announcement of Netflix Prize 2 illustrates the continued blinders that companies have about the ease with which individuals listed in supposedly anonymized datasets can be identified. The ostensibly anonymized dataset released to the public for Netflix Prize was limited to the video rental histories of 480,000 Netflix subscribers, the ratings (1-5 stars or “no rating”) that subscribers gave each movie, and the date that subscribers rated each movie.
As law professor and CDT Academic Fellow Paul Ohm has pointed out, between 2006 and 2008, researchers Arvind Narayanan and Vitaly Shmatikov showed that if you know just a little information about a friend (or enemy’s) movie-viewing habits, using the Netflix Prize database you can likely uncover every movie your friend ordered through Netflix and how she rated it. Add demographic and behavioral data into the mix and how do you even take seriously claims that the data has been and will remain de-identified?
Netflix is far from the only company releasing easily de-anonymized data under pretenses that the sets’ contents are unidentifiable. From the user identifications that came out of AOL’s infamous data release to Latanya Sweeney’s use of Massachusetts residents’ ZIP code, birth date, and gender – all found in public voter rolls – to identify individuals whose “anonymized” hospital records had been publicly released, the pretense of anonymization is time and again revealed as false. Companies need to own up to the privacy risks inherent in releasing such data and find more comprehensive and robust methods for truly de-identifying their data and protecting their customers.