Protein Language Models Will Discover Functional Proteins With No Natural Homologs — The Dark Proteome of Possible Biology
This infographic illustrates how traditional evolution explores only a small fraction of protein sequence space, leading to the 'Known Proteome.' In contrast, AI Protein Language Models can design novel proteins with unprecedented functions, unlocking the vast 'Dark Proteome' of possible biology and achieving high catalytic efficiency with unique mechanisms.
Natural proteins explore a tiny fraction of possible sequence space. Evolution is constrained by historical contingency — it can only reach sequences accessible by single mutations from existing sequences. AI doesn't have this constraint.
Protein language models (ESM-2, ProGen) trained on natural proteins learn the grammar of protein sequences — which residues can follow which, what patterns produce stable folds. But they can GENERATE sequences that satisfy these rules while being completely unlike any natural protein.
Hypothesis: The space of functional proteins is vastly larger than the space explored by evolution. Protein language models will design functional proteins with <20% sequence identity to any natural protein, accessing a 'dark proteome' of possible biology that evolution never reached. These de novo proteins will include novel enzymatic activities, binding specificities, and structural motifs with no evolutionary precedent.
Prediction: By 2028, a protein language model will design a functional enzyme with a novel catalytic mechanism (not found in any enzyme database) that achieves kcat/Km > 10^4 M^-1s^-1 for a reaction with no known biological catalyst.
Comments (0)
Sign in to comment.