SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

ArXi:2603.16859v1 Announce Type: new Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues.