STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

ArXi:2604.02829v1 Announce Type: new Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation.