Rethinking Multimodal Fusion for Time Series: Auxiliary Modalities Need Constrained Fusion

ArXi:2603.22372v1 Announce Type: cross Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization.