Should search engines retain a record of search queries? What benefits or harms flow from retaining that data? Should academic researchers be able to get access to "query log" data from search companies? What kinds of research can be done with this data? And -- critically -- what about the privacy of the search engine users?
All of these questions were debated and discussed in a workshop yesterday at the WWW 2007 conference entitled "Query Log Analysis: Social and Technological Challenges
." WWW is the leading annual academic conference focused on the Web and the Internet. This year the conference is in Banff in the Canadian Rockies (making staying indoors for the sessions quite a challenge).
The Query Log workshop addressed a fascinating set of issues, the foremost of which is the significant privacy risk raised by the retention (or distribution) of logs of search terms on sites such as Google, MSN, Yahoo, Ask etc. As the WWW event is an academic conference, there was much attention to the plight of researchers outside of the search companies. Researchers are frustrated that they have little or no access to actual data - the actual queries entered into search engines.
The companies are hesitant to disclose search data, both out of concern about compromising trade secrets about how they execute and track searches, but also because the backlash about the incident in August 2006 in which AOL released millions of search terms from about 650,000 users. Although AOL replaced user IDs with pseudonyms, it was relatively easy to identify some individual people from their search terms. There was, appropriately, a huge uproar about the harm to privacy, and AOL quickly took the data down.
Although the release of the data was clearly a mistake, AOL's intentions were in fact honorable - AOL was trying to allow academic researchers access to actual search data. And ironically, the AOL data release did allow researchers to analyze core issues about privacy. In that data, for example, were social security and credit card numbers (raising privacy concerns by themselves), and researchers were able to document how privacy could be breached using the aggregated search of individuals' searches.
focused on the critical need to protect privacy, both internally within search companies and if it is ever provided to researchers. On one hand, it would be better if query logs were never maintained at all, but it is clear that analyzing search data can help improve search results (both on an individual level and in the aggregate), and so there are good reasons to preserve some data for some time. But if any data is preserved, we must figure out how to protect the privacy of users.
One very interesting study discussed at the workshop -- "User 4XXXXX9: Anonymizing Query Logs
," by Eytan Adar -- analyzed the AOL data and concluded that out of tens of millions of search queries, a great deal of the privacy-threatening information was contained in searches that occurred only one or at most two times within the search database. Adar suggested that if we discard or mask those unique or near-unique queries, we can significantly reduce the privacy risk raised by the database of queries.
In the discussion at the workshop, some argued that we no longer have privacy, and there is no longer any consensus to protect privacy. Part of my response is contained in my slides - I think it is very significant that in every single one of the 50 U.S. states, there are laws protecting the privacy of library records. Libraries are, I think, direct precursors to today's search engines, and there seems to be broad societal consensus to protect records of our offline search for information.
The Arkansas law is particularly illuminating (at Section 13-2-701 of the state code). The law makes it a misdemeanor to disclose "confidential library records," defined as:
"documents or information in any format retained in a library that identify a patron as having requested, used, or obtained specific materials, including, but not limited to, circulation of library books, materials, computer database searches
, interlibrary loan transactions, reference queries
, patent searches, requests for photocopies of library materials, title reserve requests, or the use of audiovisual materials, films, or records."
It seems to me that this law strongly suggests that we should be protecting privacy in search queries. How to do that will be an important continuing conversation.