Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

ArXi:2508.09486v2 Announce Type: replace Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning.