GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

ArXi:2604.19624v1 Announce Type: new Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference.