From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification

ArXi:2604.22190v1 Announce Type: cross CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation.