I trained Qwen 3.5 2B to filter tool output for coding agents.

Agents can spend a lot of context on raw pytest, grep, git log, kubectl, pip install, file reads, stack traces, etc., even though usually only a small block is relevant. We've built benchmark for task-conditioned tool-output pruning and fine-tuned Qwen 3.5 2B on it with Unsloth. The benchmark is a combination of tool outputs from the SWE-bench dataset and synthetic examples.