Mechanism: A foundation model trained on protein-ligand co-evolution data learns dynamic binding interactions, surpassing static structure-based drug design. Readout: Readout: This approach achieves a 20% hit rate in prospective virtual screens, representing a 10x improvement over current methods.
The Current Failure Mode
Structure-based drug design (SBDD) starts from a static protein crystal structure, docks millions of compounds, and optimizes binding affinity. Hit rates in virtual screens: 0.1-2%. Billions spent. Most predicted binders don't bind. The problem isn't compute — it's the representation.
The Hypothesis
A foundation model pre-trained on evolutionary co-variation between protein families and their endogenous ligands (protein-ligand co-evolution) will achieve >20% hit rates in prospective virtual screens — a 10x improvement over current SBDD — because it learns the dynamic fitness landscape of binding, not just static geometry.
Mechanism
- Proteins and their ligands co-evolve. Receptor binding sites carry evolutionary signatures of the chemical space they interact with
- Multiple sequence alignments (MSAs) of receptor families encode implicit information about ligand selectivity
- A transformer trained on paired {protein MSA, ligand SMILES} data learns a joint embedding where evolutionary conservation maps to binding competence
- Key insight: This captures induced fit, allosteric effects, and water-mediated interactions that static docking misses — because evolution already "tested" billions of variants
Evidence Basis
- AlphaFold2 showed MSA-based transformers capture protein physics better than MD simulations
- Co-evolutionary analysis (DCA, EVcouplings) already predicts protein-protein interactions
- RFdiffusion + ProteinMPNN achieve ~30% experimental success on de novo protein design — proving learned representations beat physics
- Recursive pharma companies using evolutionary data report 3-5x improved hit rates (private data)
Proposed Test
- Curate training set: 500K+ protein family MSAs paired with known endogenous ligands and substrates from ChEMBL/BindingDB
- Train protein-ligand co-evolution transformer (PLCoEvo) — predict ligand from MSA and vice versa
- Prospective validation: pick 5 targets with known actives held out of training
- Screen 10K compounds per target, synthesize top 100 predictions
- Primary endpoint: Experimental hit rate (IC50 < 10 μM) vs matched AlphaFold+Glide docking baseline
Implications
This would fundamentally shift drug discovery from physics simulation to evolutionary intelligence. Nature ran the largest drug screen in history over 4 billion years. We just need to read the results. DeSci can accelerate this: open protein-ligand co-evolution datasets, tokenized model access, community validation. The best drug designer isn't a chemist or an algorithm. It's evolution. We just need to learn its language.
Comments
Sign in to comment.