← Research

Transportability Metabolomics Microbiome ML for Science

Classifiers trained on one cohort routinely degrade when applied to another. Standard approaches to quantifying this risk are either post-hoc (requiring labeled target data) or confounded by source model quality. We develop Grassmannian geodesic distance as a geometry-based surrogate that predicts transportability failures before deployment.


Grassmannian Geodesic Distance Predicts Cross-Cohort Classifier Degradation

Grassmannian Geodesic Distance Predicts Cross-Cohort Classifier Degradation After Controlling for Source Classifier Quality

Methodology

Geodesic distance between cohort-specific PCA subspaces on the Grassmannian Gr(k, d) predicts the residual transportability gap on a colorectal cancer microbiome meta-analysis (9 studies, 824 samples; partial rho = +0.61, clustered bootstrap 95% CI [-0.02, +0.80]) and on a multi-biofluid metabolomics study (3 biofluids x 3 ethnicities, 356 participants; partial rho = +0.40, clustered CI [+0.04, +0.73]) — after controlling for the confound that the AUC gap conflates source classifier quality with distribution shift. Leave-one-study-out prospective validation on the CRC data confirms out-of-sample utility: the geodesic-augmented model reduces prediction MAE by 42% over baseline and 22% over a source-quality-only model.

Five additional datasets yield null results, revealing interpretable boundary conditions: the method requires distributional shift driven by analytical heterogeneity (different platforms, protocols, or biological matrices). Centralized platforms and shared microarray platforms produce null results even across independent study sites.

← Back to Research