Your Multimodal RAG Pipeline Should Look at Images Twice

Once at ingestion to make them findable. Again at retrieval to actually answer the question. Here’s the standard advice for building a multimodal RAG system: take your PDF, extract images, run them through a vision model, the summaries as text chunks, and move on. Every image gets one shot at being understood, and that understanding happens before a single user question has been asked. This works fine until it doesn’t. And in production, it stops working exactly when it matters most. I spent days building a multimodal RAG system for large financial document analysis.