SteerSeg: Attention Steering for Reasoning Video Segmentation

ArXi:2605.14908v1 Announce Type: new Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling