AI RESEARCH
[R] Genomic Large Language Models
r/MachineLearning
•
Can a DNA language model find what sequence alignment can't? I've been exploring Evo2, Arc Institute's genomic foundation model trained on 9.3 trillion nucleotides, to see if its learned representations capture biological relationships beyond raw sequence similarity. The setup: extract embeddings from Evo2's intermediate layers for 512bp windows across 25 human genes, then compare what the model thinks is similar against what BLAST (the standard sequence alignment tool) finds. Most strong matches were driven by common repeat elements (especially Alu.