AI RESEARCH
INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning
arXiv CS.CV
•
ArXi:2412.03565v3 Announce Type: replace Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in.