The Importance of Algorithmic Fairness

Edward_Dixon · ‎02-12-2021

Algorithmic fairness is a motif that plays throughout our podcast series: as we look to AI to help us make consequential decisions involving people, guests have stressed the risks that the automated systems that we build will encode past injustices and that these decisions may be too opaque. In episode twelve of the Intel on AI podcast, Intel AI Tech Evangelist and host Abigail Hing Wen talks with Alice Xiang, then Head of Fairness, Transparency, and Accountability Research at the Partnership on AI—a nonprofit in Silicon Valley founded by Amazon, Apple, Facebook, Google, IBM, Intel and other partners. (Alice is now AI Ethics lead at Sony.) With a background that includes both law and statistics, Alice’s research has focused on the intersection of AI and the law.

“A lot of the benefit of algorithmic systems, if used well, would be to help us detect problems rather than to help us automate decisions.”

-Alice Xiang

Recognizing Historical Inequalities

Algorithmic fairness is the study of how algorithms might systemically perform better or worse for certain groups of people and the ways in which historical biases or other systemic inequities might be perpetuated by AI. The Partnership on AI speaks with a number of players, from machine learning experts to lawyers, policymakers, and compliance officers, about bias detection and bias mitigation in order to provide the industry with recommendations and best practices. An important part of this field of work is focusing on the role that demographic data has played in society over the years. In episode seven of the podcast, Yeshi Milner discussed how the use of such data has shaped things like credit scoring. While access to capital can have a major impact on the trajectory of a life, when we use AI to make decisions around bail and parole, the stakes are even higher.

Sampling bias is definitely a concern for anyone training a model, but in this domain, disentangling causation is really not easy. Alice notes that in the US there are disproportionately high false positive rates for Black defendants compared to white defendants when using risk assessment tools. Part of that is due to arrest data, where there's a higher baseline arrest rate for Black defendants compared to white defendants. That discrepancy may not simply be due to crime trends across groups—it may also follow historic over-policing of certain communities.

Recognizing this data bias, it’s important to understand how such risk assessment tools are being used every day. According to Alice, almost every state in America uses some kind of data system at some point in the criminal justice process, either at pre-trial, in sentencing, or in role-related decisions. For Europeans like myself, the scale of the American justice system is hard to comprehend, with an incarceration rate about six times that of countries in western Europe. (Here in Ireland, we have 76 prisoners per 100,000 of our population versus 655 for the US).

High Stakes Decisions

I try hard to be ruthlessly empirical, and so the idea of systematizing high-stakes decisions is extremely appealing. I’m not a gut feel person and will absolutely go with the math over what my stomach says. However, the engineer in me also quails at the thought of the responsibility that come with building such systems. When we put an algorithm into a car’s automatic braking system, into a medical care suite, or onto a judge’s desk, then we better set a very high validation bar .

Alice explains that the proliferation of criminal justice risk assessment tools is, in part, being propelled as a proposed replacement to cash bail. As with other common law countries, judges set bail based on the severity of the crime and the financial resources of the defendant. (Here in Ireland, it might also be an independent surety.) It is abhorrent that a defendant, enjoying the presumption of innocence, should be imprisoned, losing their livelihood and quite possibly the custody of their children and their home. A financial stake—in Ireland, often coming in the form of an independent surety from a friend or relative of the defendant—is intended to balance the interest of the defendant (freedom) with that of society (that the defendant appears for their trial). Like in the US, our defendants are often of modest means, with bail set accordingly, and without a requirement to actually produce cash on the spot, or as in the US system of commercialized bail bonds.

Risk assessment tools are intended to allow defendants to be released “on their own recognizance” (promising to come back for the trial), which means predicting whether they will offend in the meantime. If arrest data is used as a proxy for conviction, then the risk is that a poor but innocent defendants, who might previously have been incarcerated due to inability to raise cash, might instead be incarcerated due to a societal bias crystalized as an algorithm.

How to Solve for Data Bias

Health care is another example of data bias in practice that Alice brings up in the podcast. If a system tries to predict people's health care needs based on an algorithm trained only on health care costs, then that system may deprioritize patients that have historically not received the treatment they needed and will create further inequality. Again, this will sound strange to non-US listeners like myself, but in the US, healthcare is received and funded very differently from the “median European” system. So, it is entirely possible that patients will have gone without care due to lack of ability to pay, and an algorithm might “learn” that a given diagnosis needed no further treatment.

"A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness."

— Alfred Korzybski, Science and Sanity, p. 58.

If you are a developer and you see bias in your algorithm, what next? Alice recommends figuring out what is causing the underlying bias. After all, our data tables come from an imperfect world and “the map is not the territory” as scholar Alfred Korzybski might say. A personal data hero of mine is Ignaz Semmelweis, an 19th century Austrian doctor who spotted a worrying discrepancy in the obstetric hospital where he worked. The overall high mortality figures concealed a three-fold difference in mortality between rich mothers and poor ones. You should most definitely listen to the whole story. I won’t spoil the ending here, but sometimes the answers are simply not in your data and you need to walk the halls of your clinic to make a crucial, life-saving observation.

Data bias is a thorny, thorny problem and there are no easy answers. When we measure the weather, we can calibrate and test our instruments in a way that is far more difficult when our means of measurements are the combined result of a chain of decisions made by police officers, officers of the court, and juries. Here in Ireland, men make up 87% of our prisoners, which could reflect a greater propensity to offend or a greater reluctance to charge, convict, and sentence women. (US studies suggest the answer is a “bit of both.”) One possible route is to apply Bayesian techniques, which—in some circumstances—can uncover causation. This is not something you should do in every country, however. In France, applying statistical analysis of this type to judicial decisions could result in a five year prison term.

Looking Deeper

Alice suggests modifying the data itself to remove bias. Putting a proverbial thumb on the scale is where things get complicated. Who’s deciding exactly what percentages of certain groups need to be better represented? If I were training a model to predict re-offending among Irish parolees, data on historical arrests and convictions (six times more male prisoners) would likely result in a model that recommended detaining young men far more than young women. Should I then modify the data so that women are detained just as often and for just as long? Without some empirical basis, this would hardly be ethical, or indeed legal.

Bias is also a signal, though! Alice brings up the case of an American healthcare company that was using a model to predict outcomes relating to sepsis. The company found disproportionately worse outcomes for communities where English was a second language. Instead of approaching this bias as merely a technical problem where they needed to use some sort of algorithmic intervention, they were able to look closer at the resources in hospitals where the data was being generated and collected (like my hero Semmelweis!). What they found was essentially a communication oversight: there wasn't any Spanish language materials for sepsis awareness and detection in the hospital. It might seem strange that a Data Scientist should be the first to spot this problem (or the Semmelweis discovery), but the results were excellent—once Spanish language materials were introduced to those environments, the gap between different groups started to close.

Finding the Missing Elements

Alice believes that algorithmic bias stems from our parochial natures and the blinkers imposed by preconceived interpretations of the data. Sometimes, you really need someone with a different perspective. (I still think with fondness of a ruthlessly empirical intern we once hired, who tested many of our theories to destruction.) Alice illustrates her point with her background in a small industrial town in Appalachia, where a large portion of the population worked for one major company. In episode four of the podcast, Intel Vice President Sandra Rivera talked about using AI in human resource management for expediency, to mitigate bias, and to hire and retain talented people. Would an AI system developed in a major urban area be useful for company hiring in rural region?

As Alice points out, we have a tendency to think of our uses for these tools as being used for people very much like ourselves; we’re often subject to the ”streetlight effect” (looking for our keys where the light is, rather than where we dropped them). If most AI algorithms are being built in Silicon Valley, Alice argues, then the onus is on developers to educate themselves and recognize that they are missing key elements from other demographics, a point Rana el Kaliouby also echoed in episode eleven of the podcast. I’ll add that I believe there is massive arbitrage opportunity in identifying previous overlooked talent and that whoever figures out how to do this will have a bigger business than LinkedIn. Given the rate of talent inflow to the Bay Area, and the rate at which immigrants found successful US businesses, I’m a long-term optimist on this one.

When Open Source Isn’t Enough

Another issue with current AI systems is that we don't have very good tools for making the innards of algorithmic tools more accessible to a broader audience. When it comes to explainability, making your code open source works very well for traditional software, not so much for AI—not even if you give away your dataset (if you can give it away; very often a legal impossibility). This opacity is not unique to AI. For example: how many passengers on an Airbus could give a sensible sketch of the conditions under which the flight management system will switch to Alternate Law in the result of a horizontal stabilizer fault? But that is no reason for us to be satisfied with the status quo.

Explainabilty is far from being a solved problem—if we really understood how our models worked, we probably wouldn’t need to train them. But using the tools that are available now can help not only with a building user’s trust in your product, but to assist your team as it debugs your models. Looking for further reading? The Partnership on AI publishes their research for anyone to learn from and use.

Being Honest about Imperfect AI

Still, it’s not just making datasets and research public that will help make AI fairer and more accountable—it’s being honest about the limitation of such tools and systems.

Returning to the criminal justice context, Alice highlights that most tools being used today cannot achieve more than a 60 to 70% accuracy rate. This is a hard number to understand without any context. For example, how does it compare to the predictions of expert humans? But it is clearly a number that we should expose and track to avoid the appearance of “machine infallibility.” I personally have seen users follow machine recommendations blindly when they were described as algorithmic, until subsequent analysis showed that the algorithm was doing worse than chance. Exposing your model’s limitations is part of treating your stakeholders fairly and motivating your team to do better!

These types of distinctions are paramount. As the saying goes, “the devil is in the details.” So while ending cash bail is something many progressive groups believe would create a more equitable legal system, the Partnership of AI released a report arguing against the use of algorithmic tools for pre-trail decisions, finding itself on the side of groups like Human Rights Watch, but also alongside bail bondsman associations with an economic incentive to keep the current system in place.

Though these AI systems are imperfect, Alice does see them being potentially useful. Ideally, she says, they would be used to inform judges, but only if judges are sufficiently educated about their limitations and how to properly interpret the results, and only if developers mitigate algorithmic bias and tailor the tools to the specific jurisdictions where they are deployed.

The Future of Using AI in Decision Making

Even as more organizations and experts are taking algorithmic fairness seriously, Alice is still worried that we might find ourselves in situation where the laws and policies applied to algorithms might actually make it impossible for developers to meaningfully mitigate bias in practice. She sees this as a fundamental problem and broader tension in American society between colorblindness and race consciousness when it comes to decision-making. For example, while Irish university admissions are “blind” (a computer checks your exam scores, no interview or essay required), American universities use demographic attributes as part of admission decisions in order to address disproportionate outcomes. (Any visitor to the US will be struck by the disparity in ethnic ratios between, say, janitorial staff and engineering meetings.) Affirmative action jurisprudence is hotly debated and just a Supreme Court decision away from being invalidated—indeed, I’m almost sure it would be illegal in Ireland. Educational disparities are a little less obvious in Ireland, but very present. Here, rather than change college admissions, the remedy has been to improve pre-school and primary education so that all students are better prepared for second level and the national examinations that are the only basis for access to university places.

Alice also notes that education alone won’t solve these issues—diversity trainings aren't a replacement for a diverse team—and points to the watershed Gender Shades project from MIT that proved facial recognition algorithms weren't detecting certain faces properly. Like many other guests on the podcast, Alice believes the technology industry needs to foster a more inclusive environment for diverse AI teams, and with the perennial talent shortage, the market is moving her way.

To learn more about Alice’s work, the paper to read is “Explainable Machine Learning in Deployment.”

To hear more Intel on AI episodes with experts from across the field, find your favorite streaming platform to listen to them at: intel.com/aipodcast

The views and opinions expressed are those of the guests and author and do not necessarily reflect the official policy or position of Intel Corporation.