An effective ML pipeline must be able to recognize when predictions should and should NOT be made. Applicability domain assessment, or applicability, refers to the identification of inputs that are too different from the training data for the pipeline to render a reliable prediction. These are termed, “out-of-distribution” inputs, may be due to measurement error, labeling errors, or other pre-analytical factors.
In their simplest form, these can be a series of univariate decision boundaries that define the range of values for each feature that are considered safe for prediction, with the rest labeled “out-of-distribution”. Feasibility limits or linearity limits make good examples of these within the clinical laboratory. However, multivariate approaches are more common in practice, as they can account for the interactions between features.
Multivariate Methods
Mahalanobis Distance
The Mahalanobis distance is a measure of the distance between a point and a distribution. It is a multivariate generalization of the idea of measuring how many standard deviations away a point is from the mean of a distribution. We can use the Mahalanobis distance to identify out-of-distribution inputs in R using the code below.
Show the Code
suppressPackageStartupMessages(library(tidyverse))suppressPackageStartupMessages(library(tidymodels))# Load Model and Data## Load modelsoptions(timeout=300)model_realtime <-read_rds("https://figshare.com/ndownloader/files/45631488") |>pluck("model") |> bundle::unbundle()recipe <- model_realtime |>extract_recipe()## Load datatrain <- arrow::read_feather("https://figshare.com/ndownloader/files/45407401")validation <- arrow::read_feather("https://figshare.com/ndownloader/files/45407398")## Preprocess Datatrain_preprocessed <-bake(recipe, train)validation_preprocessed <-bake(recipe, validation)## Calculate Mahalanobis Distance from Training Set for a Subset of Points in Validation Setmahalanobis_distance <-function(data, train_preprocessed) { train_mean <-colMeans(train_preprocessed, na.rm = T) train_cov <-cov(train_preprocessed, use ="pairwise.complete.obs")mahalanobis(data, train_mean, train_cov, inverted =TRUE)}## Calculate Mahalanobis Distance for Validation Settrain_distances <-mahalanobis_distance(train_preprocessed, train_preprocessed)validation_distances <-mahalanobis_distance(validation_preprocessed, train_preprocessed)upper_bound <-quantile(train_distances, probs =c(0.999), na.rm = T)## Plot Distancesgg_maha_dist_input <-bind_rows(tibble(label ="Train", distance = train_distances), tibble(label ="Validation", distance = validation_distances)) gg_maha_dist <-ggplot(gg_maha_dist_input |> dplyr::filter(label =="Train") |>slice_sample(prop =0.01), aes(x = distance)) +stat_ecdf() +geom_vline(xintercept = upper_bound, linetype ="dashed") +annotate("text", x = upper_bound, y =0.5, label ="Out-of-Distribution", angle =-90, hjust =0, vjust =-0.5, fontface ="bold") +labs(title ="Mahalanobis Distance from Training Set",x ="Mahalanobis Distance",y ="Cumulative Proportion") +scale_x_log10()suppressWarnings(gg_maha_dist)
Principal Component Analysis (PCA) is technique that can be used to collapse correlated features into a simpler representation. With the help of the {applicable}1 package, we can use PCA to identify out-of-distribution inputs in R using the code below.
Show the Code
## Load packagelibrary(applicable)## Train PCA Modeltrain_pca <-apd_pca(train_preprocessed |>drop_na())## Calculate Distance in PC spacepca_score <-score(train_pca, validation_preprocessed_with_distances)## Add Distance to Validation Datavalidation_preprocessed_with_distances <- validation_preprocessed_with_distances |>mutate(pca_distance = pca_score$distance, pca_pctl = pca_score$distance_pctl, mahalanobis_pctl =ntile(mahalanobis_distance, n =1000))## Plot PCA Distance as compared to Mahalanobis Distancegg_pca_dist <-ggplot(validation_preprocessed_with_distances, aes(x = pca_pctl, y = mahalanobis_pctl)) +geom_point(alpha =0.1, shape =".") +scale_x_continuous(name ="PCA Distance Percentile", breaks =c(0, 50, 100), labels =c(0, 50, 100)) +scale_y_continuous(name ="Mahalanobis Percentile", breaks =c(0, 500, 1000), labels =c(0, 50, 100)) +ggtitle("Comparison of Distance Metrics for Multivariate Applicability Assessment")gg_pca_dist
While the PCA distance and the Mahalanobis distance are correlated, there are outliers along each axis that warrant further investigation. In general, PCA distance is more appropriate as input features become higher in dimensionality, or more correlated.
Key Takeaway:
Predictions made on out-of-distribution inputs should be interpreted with extreme caution. Applicability assessment provides tools for identifying these cases.