Biomedicine

Transportability Metabolomics Microbiome ML for Science

Classifiers trained on one cohort routinely degrade when applied to another. Standard approaches to quantifying this risk are either post-hoc (requiring labeled target data) or confounded by source model quality. We develop Grassmannian geodesic distance as a geometry-based surrogate that predicts transportability failures before deployment.

Grassmannian Geodesic Distance Predicts Cross-Cohort Classifier Degradation

Grassmannian Geodesic Distance Predicts Cross-Cohort Classifier Degradation After Controlling for Source Classifier Quality

Methodology

Geodesic distance between cohort-specific PCA subspaces on the Grassmannian Gr(k, d) predicts the residual transportability gap on a colorectal cancer microbiome meta-analysis (9 studies, 824 samples; partial rho = +0.61, clustered bootstrap 95% CI [-0.02, +0.80]) and on a multi-biofluid metabolomics study (3 biofluids x 3 ethnicities, 356 participants; partial rho = +0.40, clustered CI [+0.04, +0.73]) — after controlling for the confound that the AUC gap conflates source classifier quality with distribution shift. Leave-one-study-out prospective validation on the CRC data confirms out-of-sample utility: the geodesic-augmented model reduces prediction MAE by 42% over baseline and 22% over a source-quality-only model.

Five additional datasets yield null results, revealing interpretable boundary conditions: the method requires distributional shift driven by analytical heterogeneity (different platforms, protocols, or biological matrices). Centralized platforms and shared microarray platforms produce null results even across independent study sites.

Paper Code

← Back to Research

Grassmannian Geodesic Distance Predicts Cross-Cohort Classifier Degradation#

Grassmannian Geodesic Distance Predicts Cross-Cohort Classifier Degradation