Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

ArXi:2604.11244v1 Announce Type: new Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites.