Assessing Model Applicability


An effective ML pipeline must be able to recognize when predictions should and should NOT be made. Applicability domain assessment, or applicability, refers to the identification of inputs that are too different from the training data for the pipeline to render a reliable prediction. These are termed, “out-of-distribution” inputs, may be due to measurement error, labeling errors, or other pre-analytical factors.

In their simplest form, these can be a series of univariate decision boundaries that define the range of values for each feature that are considered safe for prediction, with the rest labeled “out-of-distribution”. Feasibility limits or linearity limits make good examples of these within the clinical laboratory. However, multivariate approaches are more common in practice, as they can account for the interactions between features.

Multivariate Methods

Mahalanobis Distance

The Mahalanobis distance is a measure of the distance between a point and a distribution. It is a multivariate generalization of the idea of measuring how many standard deviations away a point is from the mean of a distribution. We can use the Mahalanobis distance to identify out-of-distribution inputs in R using the code below.

Show the Code
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))

# Load Model and Data
## Load models
options(timeout=300)
model_realtime <- read_rds("https://figshare.com/ndownloader/files/45631488") |> pluck("model") |> bundle::unbundle()
recipe <- model_realtime |> extract_recipe()

## Load data
train <- arrow::read_feather("https://figshare.com/ndownloader/files/45407401")
validation <- arrow::read_feather("https://figshare.com/ndownloader/files/45407398")

## Preprocess Data
train_preprocessed <- bake(recipe, train)
validation_preprocessed <- bake(recipe, validation)

## Calculate Mahalanobis Distance from Training Set for a Subset of Points in Validation Set
mahalanobis_distance <- function(data, train_preprocessed) {
  
  train_mean <- colMeans(train_preprocessed, na.rm = T)
  train_cov <- cov(train_preprocessed, use = "pairwise.complete.obs")
  mahalanobis(data, train_mean, train_cov, inverted = TRUE)
  
}

## Calculate Mahalanobis Distance for Validation Set
train_distances <- mahalanobis_distance(train_preprocessed, train_preprocessed)
validation_distances <- mahalanobis_distance(validation_preprocessed, train_preprocessed)
upper_bound <- quantile(train_distances, probs = c(0.999), na.rm = T)

## Plot Distances
gg_maha_dist_input <- bind_rows(tibble(label = "Train", distance = train_distances), tibble(label = "Validation", distance = validation_distances)) 

gg_maha_dist <- 
  ggplot(gg_maha_dist_input |> dplyr::filter(label == "Train") |> slice_sample(prop = 0.01), aes(x = distance)) +
  stat_ecdf() +
  geom_vline(xintercept = upper_bound, linetype = "dashed") +
  annotate("text", x = upper_bound, y = 0.5, label = "Out-of-Distribution", angle = -90, hjust = 0, vjust = -0.5, fontface = "bold") +
  labs(title = "Mahalanobis Distance from Training Set",
       x = "Mahalanobis Distance",
       y = "Cumulative Proportion") + 
  scale_x_log10()

suppressWarnings(gg_maha_dist)

Show the Code
library(gt)

validation_preprocessed_with_distances <- validation_preprocessed |> mutate(mahalanobis_distance = validation_distances)

validation |> select(sodium, chloride, potassium_plas, co2_totl, bun, creatinine, calcium, glucose) |> bind_cols(validation_preprocessed_with_distances |> select(matches("dist"))) |> arrange(desc(mahalanobis_distance)) |> slice_head(n = 1) |> select(-matches("delta|dist|_id|dt_tm")) |> pivot_longer(cols = everything(), names_to = "Analyte", values_to = "Result") |> gt() |> tab_header("Example Out-of-Distribution BMP")
Example Out-of-Distribution BMP
Analyte Result
sodium 83.00
chloride 60.00
potassium_plas 1.70
co2_totl 6.00
bun 2.00
creatinine 0.28
calcium 5.00
glucose 1500.00

Principal Component Analysis

Principal Component Analysis (PCA) is technique that can be used to collapse correlated features into a simpler representation. With the help of the {applicable}1 package, we can use PCA to identify out-of-distribution inputs in R using the code below.

Show the Code
## Load package
library(applicable)

## Train PCA Model
train_pca <- apd_pca(train_preprocessed |> drop_na())

## Calculate Distance in PC space
pca_score <- score(train_pca, validation_preprocessed_with_distances)

## Add Distance to Validation Data
validation_preprocessed_with_distances <- validation_preprocessed_with_distances |> mutate(pca_distance = pca_score$distance, pca_pctl = pca_score$distance_pctl, mahalanobis_pctl = ntile(mahalanobis_distance, n = 1000))

## Plot PCA Distance as compared to Mahalanobis Distance
gg_pca_dist <-
  ggplot(validation_preprocessed_with_distances, aes(x = pca_pctl, y = mahalanobis_pctl)) +
    geom_point(alpha = 0.1, shape = ".") + 
    scale_x_continuous(name = "PCA Distance Percentile", breaks = c(0, 50, 100), labels = c(0, 50, 100)) + 
    scale_y_continuous(name = "Mahalanobis Percentile", breaks = c(0, 500, 1000), labels = c(0, 50, 100)) + 
    ggtitle("Comparison of Distance Metrics for Multivariate Applicability Assessment")

gg_pca_dist

While the PCA distance and the Mahalanobis distance are correlated, there are outliers along each axis that warrant further investigation. In general, PCA distance is more appropriate as input features become higher in dimensionality, or more correlated.

Key Takeaway:
Predictions made on out-of-distribution inputs should be interpreted with extreme caution.  Applicability assessment provides tools for identifying these cases.

References

1.
Gotti M, Kuhn M. Applicable: A compilation of applicability domain methods [Internet]. 2022. Available from: https://CRAN.R-project.org/package=applicable