STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

ArXi:2604.18177v1 Announce Type: new Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which