Ethically Scraping and Accessing Data: Governments Desperately Seeking Data

May 3, 2018 / Joseph Jerome

As cities get smarter, their appetite and access to information is also increasing. The rise of data-generating technologies has given government agencies unprecedented opportunities to harness useful, real-time information about citizens. But governments often lack dedicated expertise and resources to collect, analyze, and ultimately turn such data into actionable information, and so have turned to private-sector companies and academic researchers to get at this information. Municipalities and government bodies have a responsibility to be thoughtful about how they deploy technologies and collect and protect sensitive information, but the rules around how and when governments can access privately-held data, and the privacy and security obligations that are conferred with that access, are murky.

One good example of this tension occurred last year when the San Francisco County Transportation Authority (SFCTA) engaged a group of academic researchers at Northeastern University to investigate Lyft and Uber activity in the city. This occurred after months of negotiation with the companies that broke down over the type of sensitive data being requested by SFCTA and questions about what privacy and security protections the city would put in place around the data. Eventually, SFCTA felt that ride-hailing companies were not forthcoming with adequate data and decided to work with the researchers instead.

On the surface, the SFCTA’s rationale for requesting Lyft and Uber data seems reasonable; they were seeking more information about the citywide impact of ride-hailing operations. But the methods used by the researchers bring up ethical questions about what data collection practices should be permissible in the name of public interest.

Northeastern’s researchers used a combination of fake accounts and scripts to scrape detailed, raw trip information from Lyft and Uber. Essentially, the researchers mimicked the apps via each platform’s public-facing Application Programming Interface (API). With this information in hand, the SFCTA was able to learn about vehicle movements around San Francisco in real-time as well as gather location information about pick-ups and drop-offs around town. The SFCTA has said this entire exercise was above board, emphasizing that all of the information was collected through the open API, thereby making all the information collected “publicly accessible.” As Woody Hartzog has argued, just because information is accessible does not make it public. The fact that an API is accessible, too, does not mean that researchers should have free reign to abuse the terms on which access is provided.

The SFCTA appears to be suggesting that using “public-facing” APIs to access and acquire large amounts of data is fundamentally the same thing as web scraping publicly-accessible websites, but it is not. While scraping goes hand-in-hand with the very concept of an open web, using a company-provided API to access and “scrape” information raises different considerations. There is an implied social contract in web scraping; information available on a webpage is ostensibly available to anyone who can view the site. However, as we have seen with Alexsander Kogan’s acquisition of data from Facebook’s developer API, APIs function as a conduit to potentially huge quantities information stored by a web service or platform. To unpack this a bit, it’s helpful to consider how information can be accessed by third parties:

Scraping: Using a browser or web tool to download page contents and pull out (or “scrape”) individual data elements of interest.
Scanning: Using a technical process over and over again across the entire internet (or a subset of the entire internet) to interact with devices one by one to assess information or characteristics of those devices. This can be web (port 80/443) but can also be specific to other internet protocols such as the IP protocol, VPN protocols, or other kinds of transport protocols (e.g., TCP/UPD).
Interaction via API: Using the underlying software communications components of an app or application to interact with a internet service get information and perform functions.

As touched on above, at the behest of the SFCTA, Northeastern researchers mimicked the Lyft and Uber apps through the API and then pulled information about Lyft and Uber vehicle locations from that data store. Though this is normal interaction with an API, it went against the privacy expectations of Lyft and Uber and their users. SFCTA appears to dismiss these concerns, stating that it did not receive any “personally identifiable” data. While that appears to be true, this argument is disingenuous and ignores some of the harder, ethical questions at stake. It also effectively offloads responsibility for protecting and securing data onto researchers that lack the same responsibilities and duties to the public as government officials.

The data access activities undertaken by the Northeastern researchers also were clear violations of both Lyft and Uber’s terms of service, and had something gone wrong here, it is unclear where responsibility (or liability) for the research would fall: the researchers, SFCTA public officials, or perhaps even the ride-hailing platforms for failure to provide reasonable data security. Facebook’s data sharing arrangement with academic researcher Alexsander Kogan, and his subsequent work with Cambridge Analytica, is another high profile example of how vagueness around liability in law and policy perpetuate privacy violations.

There are a variety of factors to consider when it comes to balancing the public interest against legal terms and conditions. Researcher Ben Edelman has looked at potential discrimination on platforms like Airbnb and called for research exceptions in commercial terms of service, what he terms “bona fide testing.” Others have proposed more formalized “data collaboratives” where organizations can work directly with private companies, particularly social media platforms, to use data responsibly and for the public good. Though challenging, another approach would be to facilitate the establishment of trusted intermediaries that are backed by appropriate legal and technical safeguards.

Openness is crucial to the health of the internet, and companies must find a way to allow access and scraping without punishing good actors. Throwing the legal book at any third party for illicitly acquiring information, for example, creates a cat and mouse game between companies and researchers that incentivizes opacity and secrecy. Researchers must also be cognizant of the fact that scraping activities, fairly or not, violate numerous state and federal laws – including breach of contract, trespass, copyright – though claims brought under the Computer Fraud and Abuse Act (CFAA) may appear more suspect.

Before ideas like research exceptions or formalized public-private partnerships can come to fruition, it is necessary to determine what the proper roles and responsibilities of government actors interested in this research should be. CDT provided such a roadmap in its 2015 whitepaper, Data in the On-Demand Economy, which looked at how longstanding Fair Information Practice Principles (FIPPs) could govern municipal data demands from technology companies disrupting local transit, utilities, and healthcare. At the time, we argued that local policymakers needed to ask the following questions:

Legal Basis of Collection: Under what authority is data collected? It is unclear what mandate provided SFCTA with authority to collect Lyft and Uber’s information. In California, platforms like Lyft and Uber are regulated by the California Public Utilities Commission (CPUC), and there is an ongoing rulemaking at the CPUC regarding commercial information collection and sharing procedures. It is unclear why SFCTA circumvented this process and direct negotiations with the ride-hailing platforms to acquire data from a separate third party.
Duration of Access: How long are companies required to provide data to agencies? How long is data retained? SFCTA’s has bemoaned that its collaboration with the Northeastern researchers was limited to only a “snapshot” in time, and its public FAQ suggests that the academic researchers’ work “precluded ongoing monitoring in San Francisco.” This obfuscates the fact that Lyft and Uber worked to limit the functionality of their APIs as a result of this data grab. The tension point here is that the SFCTA is advocating CPUC to get access to ongoing, real-time access to information about ride-hailing trips, including precise location information and complete ride telemetry.
Scope of Collection: What categories of data, including sensitive data, are transmitted? Is the data de-identified to any extent? The SFCTA insists that all data it collected were “aggregated and averaged, and contain no personally identifiable information,” but it is unclear if this is simply in reference to the SFCTA’s public portal and the information it has made available. It is clear, however, that the Northeastern researchers came into possession of location information.
Security of Transmission: Is the data being transmitted in a secure way, using encryption technologies? It is unclear what security precautions were or have been taken to ensure continued secure storage, both in transit and at rest, by the SFCTA and Northeastern teams. We believe any sort of data project, whether lead by government or an independent research team, should entail a clear set of security obligations that are agreed to by both parties and made public.
Secondary Uses: Is the data collected used for other purposes besides its primary purpose? While the SFCTA insists that this data is being used to understand city traffic congestion and characteristics of the ride-hailing markets in San Francisco, it has also stated that this project will “[s]erve as the foundation of future research.” CDT has emphasized repeatedly the need for government data projects to be clearly scoped, and government data projects must be careful to avoid “mission creep.”
Transparency: How do governments and companies let individuals know about the frequency, type, and nature of data requests? Do individuals have access to this data? The SFCTA’s public portal and an FAQ provides some details into this project. While useful, the public also deserves to know more about the procedures and rules governing the collaboration between the SFCTA and the Northeastern researchers.

This episode, and Facebook’s recent struggles, suggest that revisions and updates to our initial slate of recommendations are warranted. As CDT develops these recommendations, several factors should be considered including:

Clear and standardized rules for how data is shared and accessed in the public interest. Many of these partnerships and methods of acquiring data are ad hoc. While platforms like Lyft and Uber have begun to formalize how they share information, many companies have no clear process for researcher engagement. Researchers should also support calls for best practices in this space that could address their data needs and research prerogatives while recognizing industry’s imperative to better protect data. These efforts are necessary now more than ever as the open data movement has begun to encourage governments to think about the wider privacy impacts of public data releases.
Community review and involvement. Secretive government-sponsored or run data collection programs perpetually elicit privacy and ethical concerns. As we have seen in the context of data-driven policing and elsewhere, these efforts can result when passionate and empowered public officials sign-off on projects without full and complete due diligence. Moving forward, smart city initiatives will need to formulate boards, offices, or “ombudsman”-type entities that can review initiatives from a wider perspective and provide some assurance of standardized review. Calls for adapting elements of human subject research Institutional Review Boards (IRB) into data and wider-ICT research make sense but these reviews also need to address government and researcher collaborations. CDT recommends, at the very least, that reviews include a legal and ethical assessment of the alignment between researcher data access methods and a company’s Terms of Service provisions.

As the maze of partnerships among public officials, private companies, academics, and independent researchers becomes more tangled, a clear path out of the status quo may be challenging. On-demand platforms, as they continue to disrupt local economies, continue to be a significant flashpoint. They may present a necessary starting point to create a broadly adopted framework for government access to commercial data for the public interest, and CDT intends to continue this work.

Ethically Scraping and Accessing Data: Governments Desperately Seeking Data

Related Reading

CDT Files Comments with DOJ in Response to Advance Notice of Proposed Rulemaking on Bulk Sale of Data

CDT’s Matt Scherer Testifies Before Connecticut Senate’s General Law Committee on Senate Bill 2, An Act Concerning Artificial Intelligence

CDT Europe’s AI Bulletin: April 2024