A crucial aspect of deploying machine learning (ML) models in clinical laboratories is ensuring that they achieve their desired goals without introducing or exacerbating inequity in healthcare delivery. We will examine these vulnerabilities through the lens of fairness concepts and their metrics during the model validation process. Azimi and Zaydman1 have provided a more comprehensive overview of the key considerations for laboratory medicine.
Let’s explore how we will calculate these metrics using our normal saline predictor in R.
First, we’ll add a set of simulated labels to our validation set.
We’ll make contamination twice as common in females as in males.
Contaminated Results by Sex
Female
Male
Negative
136807
137511
Positive
1196
590
Calculating Group-Wise Performance Metrics
Next, we’ll calculate a set of performance metrics, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and flag rate, for each sex (left).
Then, we’ll compare these metrics across the demographic groups using the fairness concepts described above (right).
Differences in incidence of contamination across the demographic groups leads to:
Discrepant positive predictive values and poor predictive parity.
Class imbalance, with positive cases being quite rare, leads to:
Large relative, but small absolute, differences in flag rate.
Given the random nature of the assigned labels:
Sensitivity and specificity are nearly identical across groups.
Key Takeaway:
Performance assessment should incorporate concepts of algorithmic fairness to protect against the introduction or exacerbation of inequity.