Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

ArXi:2509.12464v2 Announce Type: replace Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce thinking tokens but with worse performance.