UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

ArXi:2601.08321v3 Announce Type: replace With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image.