JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning

ArXi:2604.24031v1 Announce Type: new The encoder-decoder framework has become widely popular nowadays. In this model, the encoder extracts informative visual features from an input image, and the decoder employs a sequence-to-sequence formulation to generate the corresponding textual description from these features. The existing models focus on the decision part. However, extracting meaningful information from the image can help the decoder generate an accurate caption by providing information about the objects and their relationship. Remote sensing images are highly complex.