On Managing Risk in Machine Learning Projects
Written by Guest Post
Written by CDT summer intern Galen Harrison.
The white paper “Beyond Explainability,” published by the Future of Privacy Forum and Immuta (a startup which provides streamlined data management services), is an attempt to sketch out how, organizationally, one can manage risk in a machine learning (ML) project. In this proposal they suggest a three-tiered oversight model composed of database administrators, data scientists (both as implementers and auditors), and domain and governance experts. These parties are divided into three groups: one that does the actual implementation, one that audits/validates the implementing team, and one that assesses what sorts of criteria the auditing team should look for.
Some quick background on machine learning: machine learning is the practice of using data to automatically solve computational problems. As an example, the post office uses a form of machine learning to turn handwritten addresses into machine readable addresses. A solution to a machine learning problem is called a model. One produces a model by training it on the data. There must be a connection between the data you use and the task you want to achieve; for example, to train machines to better recognize handwritten text, the post office used images of handwritten letters. During training, the model tries to learn how to solve the problem by looking at the data; a data scientist or developer must specify how the model should be trained and what its objective is. The format of the model will depend on the algorithm that the developer chose, and the content of the model will depend on the data.
Certain algorithms work better for certain problems, and certain algorithms only work with certain kinds of data. The largest distinction in algorithms is between supervised and unsupervised methods. In a supervised method, the data must be labelled; that is, the output we want the model to produce must be included with the data to serve as an example for the model to train against. For example, the data set the post office used paired each letter image with the letter it was supposed to represent. In unsupervised methods, the data doesn’t have the ideal answers associated with it. In general, supervised methods are more powerful, but labelled data is harder to find.
It’s worth pausing at this point to note that the FPF/Immuta report targets black box models; that is, where the format of the model doesn’t allow us to directly observe how it makes its decisions. As the authors admit, this may or may not be a warranted assumption. Engineers may need to make a tradeoff between explainability and accuracy when determining what sort of model or assembly of models to use; that is, even if a black-box model performs better than one that is less opaque (easier to understand), the designer may choose the more transparent model.
However, while there are use cases where the “how” of the model won’t be accessible or interpretable even to the person who specified the model, there’s good reason to believe that these instances are rarer than the layperson might be lead to believe. For example, the authors of this blog post were able to nearly reproduce the accuracy of the “AI Gaydar” paper (which used a neural network, possibly the most opaque type of ML algorithm) with only six yes or no questions. In fact, it’s folklore in academia that a great deal of ML applications in industry are actually just statistical regressions (a very basic and interpretable form of model that actually predates computers). Obviously, a model that you can’t interpret presents some unique challenges, but that doesn’t mean that relatively transparent models won’t also act in undesirable ways.
In the template described by the FPF/Immuta white paper, at the outset of a machine learning project, the objectives, assumptions, use cases, and desired and undesired outcomes are specified. Once this has been done, three groups are formed. The implementation team is tasked with actually implementing and documenting the project, and the validation team is given the task of reviewing the implementation team’s work to see that they are adhering to their documentation and data quality obligations. Finally, the review team periodically reviews the project’s key assumptions. The report’s authors focus their recommendations around understanding the inputs and the outputs of the model – that is trying to understand the behavior of the model by looking at how it reacts to certain inputs.
Just as the concept of “machine learning” can refer to a broad range of techniques, it can also be used to solve a range of different problems. ML can be used for things like targeting ads, but it can also be used for things like autonomous vehicle systems or assisting doctors in determining the course of treatment for cancer patients. These applications vary in what the authors of the report call “materiality” – the cost of being wrong. They also vary in the amount of agency devolved to the model. A doctor can override a recommendation they believe to be erroneous, whereas the driver of an autonomous vehicle may not have time to respond to a misclassified pedestrian. The assurance process for an autonomous vehicle system should probably be significantly different from the assurance process for ad targeting systems in ways that go beyond the degree of scrutiny. While the template would adapt quite well to the latter case, in the former one would likely want the validation team to not just review, but also to engage in their own testing (in the report template, the first group is mostly responsible for the practicalities of testing).
Furthermore, the types of unwanted behavior that need to be considered can vary widely, in ways that can’t be entirely accounted for in the reports’ monitoring-centric approach. Recent work by a group at University of Michigan showed that it is possible to make small visual modifications to stop signs to make common computer vision techniques mistake them for another sort of sign. Beyond active interference, other researchers have determined that for some learning tasks, it is possible to infer membership in training data given sufficient access. Neither of these concerns would be detected or mitigated through monitoring. They would need to be integrated into the design of the model and tested for prior to release. While live monitoring is a necessary component for mitigating risks, it can’t replace an understanding of the task setting and having an understanding of what, mathematically at least, the model is doing.
Having a clear view of the design objectives prior to implementation is important in most engineering disciplines, but when it comes to fairness, it is even more important. The vast disparity between men and women who graduate with undergraduate degrees in computer science is notable because of the belief that aptitude in computer science is uniformly distributed across genders. Recent work by Moritz Hardt has suggested that for ML models to be meaningfully fair, it is necessary to explicitly commit to a set of assumptions about how the world works. Beyond specific approaches, making clear, specific assumptions about the world is an integral part of building reliable software. Committing to key assumptions ahead of time will help avoid situations where the implementation team’s view of the world is developed around the model rather than vice-versa.
There is however a countervailing need to understand and adapt to the ways in which a model is actually produced, used, and understood. Ananny and Crawford have suggested that when trying to understand a system (computational, social, governmental, or organizational) it’s more important to look across the system than into it. That is to say, it is more important to understand the relationship between human and non-human actors in the assemblage than it is to understand the specific functioning of any particular component. The report’s scheme accounts for this insofar as they ask that the review team periodically review key assumptions. However, the range of techniques they propose for understanding the model could be expanded. As it currently stands, they limit their suggestions to quantitative methods exclusively centered around the model. Seeing across the model requires looking qualitatively and quantitatively at both the model and the broader context in which it is embedded. Expanding the suite of tools and scope of review beyond just the model would permit a greater amount of transparency into the model.
The FPF template seems appropriate for most, but not all, ML projects. When considering whether to form a process modeled after this template, practitioners should carefully consider the scope and setting of their ML operations and whether they share this template’s main concerns. A good starting place for parties considering implementing model governance would be to identify where their concerns and use cases correspond with the template’s and where they differ, and develop a plan for how they will handle the different concerns. Implementers of model governance should also consider the broader context of the models’ production and use – focusing exclusively on what the model’s doing ignores key aspects of how models may be used and interpreted.