Pretraining with hierarchical memories: separating long-tail and common knowledge

ArXi:2510.02375v3 Announce Type: replace-cross The impressive performance gains of modern language models currently rely on scaling parameters: larger models world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pre