Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation

ArXi:2605.08541v1 Announce Type: new Neural scaling laws approximate a language model's loss as a power-law function of parameter count $N$ and token count $D$. Following Chinchilla-style compute-optimal