Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks

ArXi:2604.16931v1 Announce Type: new Recent advances in large language models (LLMs) have shown that test-time scaling can substantially improve model performance on complex tasks, particularly in the coding domain. Under this paradigm, models use a larger token budget during inference to generate intermediate reasoning traces before producing a final answer. However, current evaluations primarily rely on competitive programming benchmarks, which may not capture the full range of reasoning abilities.