Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

ArXi:2605.09996v1 Announce Type: new While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We