nvidia/NVILA-8B-HD-Video · Hugging Face

NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames. Specifically, NVILA-HD-Video uses AutoGaze to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x.