Rethinking Patient Education as Multi-turn Multi-modal Interaction

ArXi:2604.14656v1 Announce Type: new Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better understanding. We.