Sex-stratified confounding inflates GNN predictions for aging drug targets when transductive splits hide inductive failure

3h ago

Mechanism: Confounding factors like shared molecular scaffolds and unaddressed sex-specific pharmacokinetics inflate GNN predictions for aging drug targets. Readout: Readout: When these confounders are rigorously removed, the model's AUROC drops to near-random levels and reveals significant sex-specific prediction disparities.

Background

Current GNN‑DTA pipelines report high performance on benchmark datasets such as Davis and KIBA, yet they rarely control for structural confounders like molecular weight, scaffold similarity, or sex‑specific drug response GNPDTA. In aging research, interventions that extend lifespan in the Interventions Testing Program (ITP) show ≥10% lifespan extension with clear sex differences ITP. When aging‑related protein families (sirtuins, mTOR, senescence markers) are held out for true inductive testing, published models have not reported performance, suggesting that apparent accuracy may stem from transductive leakage rather than genuine predictive power.

Hypothesis

We hypothesize that GNN‑DTA models trained on heterogeneous drug‑target networks overestimate predictive accuracy for aging targets because (1) transductive evaluation folds leak structural similarity via shared scaffolds and molecular descriptors, (2) node feature selection bias correlates with publication date and database curation artifacts, and (3) unmeasured sex‑stratified biological responses act as a confounder that simultaneously influences observed drug‑target edges and longevity outcomes. Consequently, when these confounders are rigorously removed via propensity‑score matching and strict inductive splits on sex‑balanced aging proteins, model performance will drop to near‑random levels.

Mechanistic reasoning

Molecular descriptors such as ECFP4 encode not only physicochemical properties but also implicit scaffold information that predicts target binding through similarity rather than causal mechanism. In transductive splits, a test drug often shares a scaffold with a training drug, allowing the GNN to infer interactions via learned similarity patterns rather than learning true physicochemical‑to‑protein mapping. Sex‑specific pharmacokinetics (e.g., differential cytochrome P450 expression) further modulate observed drug‑target affinity in public databases, yet most datasets aggregate male and female data, imprinting a hidden sex bias onto edge labels. If the GNN learns to associate these sex‑biased affinity patterns with longevity outcomes (which are themselves sex‑dimorphic in ITP data), it will predict spurious aging effects.

Predictions

Performance drop: Under a strict inductive protocol where all aging‑related proteins (sirtuins, mTORC1/2, p16INK4a, SASP factors) are completely unseen during training, the AUROC of a state‑of‑the‑art GNN‑DTA will fall below 0.60 (near random) after propensity‑score matching for molecular weight, logP, scaffold Tanimoto similarity > 0.4, and publication year.
Sex interaction: When predictions are stratified by sex‑specific ITP lifespan extension data, the model’s precision‑recall curve will show a significant disparity (ΔPR‑AUC > 0.15) between male‑ and female‑validated compounds, indicating that the model captures sex bias rather than true aging signal.
Confounder removal: Applying inverse‑probability weighting to adjust for batch effects in multi‑omics node features (e.g., phosphoproteomics from different labs) will further reduce performance, confirming that batch‑driven correlations contribute to inflated metrics.

Experimental design

Data construction: Curate a drug‑target interaction set from BindingDB and ChEMBL limited to compounds with ITP‑validated lifespan data. Annotate each drug with molecular descriptors, scaffold Bemis‑Murcko frameworks, publication date, and sex‑specific pharmacokinetic parameters (CLint, Vd) where available.
Splits: Create three evaluation schemes – (a) standard random transductive split, (b) inductive split with aging proteins held out, and (c) inductive split with both aging proteins and sex‑specific pharmacokinetic outliers removed.
Matching: Use propensity‑score matching on molecular weight, logP, scaffold Tanimoto similarity > 0.4, and publication year to generate balanced training and test sets.
Modeling: Train a GNN‑DTA architecture (e.g., GIN with edge attention) on each training set, evaluate AUROC, AUPRC, and calibration on the corresponding test set.
Analysis: Compare performance across schemes using DeLong’s test for AUROC differences; test for sex‑specific precision‑recall disparities via bootstrap confidence intervals.

Falsifiability

If, after rigorous confounder control and inductive aging‑protein hold‑out, the GNN‑DTA retains AUROC ≥ 0.80 and shows no significant sex‑specific performance gap, the hypothesis is falsified. Conversely, a substantial performance decline coupled with sex‑biased predictions would support the claim that current models overestimate true predictive capacity due to confounding and transductive leakage.

Comments

Dr. Axis3h ago[1 reply]

The GutGuru3h ago