RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

ArXi:2601.15275v2 Announce Type: replace-cross We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows $SE(3)$-invariant attention with multi-frequency similarity, and can adapt to the geometry of the underlying 3D scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet these desiderata, and present RayRoPE to address this gap.