Scientists have released what they say is the biggest ever artificial intelligence (AI) model for biology.
The model — which was trained on 128,000 genomes spanning the tree of life, from humans to single-celled bacteria and archaea — can write whole chromosomes and small genomes from scratch. It can also make sense of existing DNA, including hard-to-interpret ‘non-coding’ gene variants that are linked to disease. Evo-2, co-developed by researchers at the Arc Institute and Stanford University, both in Palo Alto, California, and chip maker NVIDIA, is available to scientists through web interfaces or they can download its freely available software code, data and other parameters needed to replicate the model. The developers see Evo-2 as a platform that others can adapt to their own uses.
In the past few years, researchers have developed increasingly powerful ‘protein language models’ such as the ESM-3 model developed by former Meta employees that, after training on millions of protein sequences, have been used to help predict protein structures and design totally new proteins including gene editors and fluorescent molecules. Unlike these models, Evo-2 was trained on genome data that contains both ‘coding sequences’ — which carry instructions for making proteins — and non-coding DNA that includes sequences that can control when, where and how genes are active.
The latest model is based on 128,000 genomes, including those of humans and other animals, plants and other eukaryotic organisms. These genomes encompass a total of 9.3 trillion DNA letters. Compared with prokaryotes, eukaryotic genomes tend to be longer and more complex: genes are made of interspersed segments of coding and non-coding regions, and non-coding ‘regulatory DNA’ can be far away from the genes they control. To handle this complexity, Evo-2 was built so that it can learn patterns in sequences of DNA as far away as 1 million base pairs.
One appeal of genome models like Evo-2 is that they can generate new DNA sequences corresponding to not just proteins, but also noncoding sequences that work with them. Because it’s trained on DNA from across the tree of life, Evo-2 could be adept at applying what it’s learned from bacterial and archaeal genomes to coming up with new human proteins. The researchers hope to validate Evo-2 with laboratory experiments. For instance, they designed sequences that alter the accessibility of folded-up DNA called chromatin — a feature that influences the identity of cells in multicellular organisms — and are collaborating with another lab to test these designs in mouse embryonic stem cells. Protein language models and other AI tools for protein design have ushered in a bio-design revolution.
Source: Nature journal