UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

ArXi:2602.19442v3 Announce Type: replace Vision-language models (VLMs) can describe urban scenes in rich detail, yet consistently fail to produce reliable human preference labels in domain-specific tasks such as safety assessment and aesthetic evaluation. The standard fix, fine-tuning or RLHF, requires large-scale annotations and model re