Training mRNA Language Models Across 25 Species for $165 (47 minute read)

TLDR AI
Generative AI NLP AI Hardware AI Research

OpenMed built an end-to-end protein AI pipeline that covers structure prediction, sequence design, and codon optimization. The team compared multiple transformer architectures for codon-level language modeling and found that CodonRoBERTa-large-v2 was the clear winner, with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. They then scaled to 25 species, trained four production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers.