Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

ArXi:2510.12720v2 Announce Type: replace Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored.