TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

ArXi:2605.16790v1 Announce Type: new Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability.