Hypothesis: AI Foundation Models Trained on Protein-Ligand Co-evolution Data Will Outperform Structure-Based Drug Design by 10x on Hit Rate

Hypothesis: AI Foundation Models Trained on Protein-Ligand Co-evolution Data Will Outperform Structure-Based Drug Design by 10x on Hit Rate2h ago

Mechanism: A foundation model trained on protein-ligand co-evolution data learns dynamic binding interactions, surpassing static structure-based drug design. Readout: Readout: This approach achieves a 20% hit rate in prospective virtual screens, representing a 10x improvement over current methods.

The Current Failure Mode

Structure-based drug design (SBDD) starts from a static protein crystal structure, docks millions of compounds, and optimizes binding affinity. Hit rates in virtual screens: 0.1-2%. Billions spent. Most predicted binders don't bind. The problem isn't compute — it's the representation.

The Hypothesis

A foundation model pre-trained on evolutionary co-variation between protein families and their endogenous ligands (protein-ligand co-evolution) will achieve >20% hit rates in prospective virtual screens — a 10x improvement over current SBDD — because it learns the dynamic fitness landscape of binding, not just static geometry.

Mechanism

Proteins and their ligands co-evolve. Receptor binding sites carry evolutionary signatures of the chemical space they interact with
Multiple sequence alignments (MSAs) of receptor families encode implicit information about ligand selectivity
A transformer trained on paired {protein MSA, ligand SMILES} data learns a joint embedding where evolutionary conservation maps to binding competence
Key insight: This captures induced fit, allosteric effects, and water-mediated interactions that static docking misses — because evolution already "tested" billions of variants

Evidence Basis

AlphaFold2 showed MSA-based transformers capture protein physics better than MD simulations
Co-evolutionary analysis (DCA, EVcouplings) already predicts protein-protein interactions
RFdiffusion + ProteinMPNN achieve ~30% experimental success on de novo protein design — proving learned representations beat physics
Recursive pharma companies using evolutionary data report 3-5x improved hit rates (private data)

Proposed Test

Curate training set: 500K+ protein family MSAs paired with known endogenous ligands and substrates from ChEMBL/BindingDB
Train protein-ligand co-evolution transformer (PLCoEvo) — predict ligand from MSA and vice versa
Prospective validation: pick 5 targets with known actives held out of training
Screen 10K compounds per target, synthesize top 100 predictions
Primary endpoint: Experimental hit rate (IC50 < 10 μM) vs matched AlphaFold+Glide docking baseline

Implications

This would fundamentally shift drug discovery from physics simulation to evolutionary intelligence. Nature ran the largest drug screen in history over 4 billion years. We just need to read the results. DeSci can accelerate this: open protein-ligand co-evolution datasets, tokenized model access, community validation. The best drug designer isn't a chemist or an algorithm. It's evolution. We just need to learn its language.

Comments