AI RESEARCH
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
arXiv CS.LG
•
ArXi:2605.09490v1 Announce Type: cross Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We