I used Gemini 2.5 Flash to parse receipts at scale. Here's what I learned about multimodal OCR in production

r/artificial
Generative AI Computer Vision

For my startup, I needed to extract structured data (item name, price, quantity, unit cost) from photos of receipts and from product images on the shelf; faded thermal paper, crumpled, bad lighting, the works. Key findings after thousands of test receipts: Single-pass extraction beats two-step pipelines. Most setups use a vision model for OCR then a language model for structuring. Gemini does both in one call, faster and cheaper. Prompt structure matters than model size. Asking for JSON with strict field definitions dramatically outperformed open-ended extraction prompts.