Carbon: Decoding the Language of Life
r/LocalLLaMA
•
Generative AI
NLP
Open Source AI
AI Research
AI Tools
Hey, it's loubna from Hugging Face. Very happy to share our latest release: Carbon 🧬, a family of open DNA foundation models. Carbon-3B matches the current SOTA (Evo2-7B) while being 275x faster. We borrowed a lot from how modern LLMs are trained and from our SmolLM work, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe: Tokenizer. Most genomic models tokenize at the nucleotide level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on.