Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

ArXi:2602.01418v2 Announce Type: replace-cross We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics.