Multimodal Semantic Interaction for Text Image Super-Resolution

Jiang, Xin; Xin Jiang

doi:https://dx.doi.org/10.70729/SE25626091503

Multimodal Semantic Interaction for Text Image Super-Resolution

Xin Jiang

Abstract: To address the issues in existing methods where text image feature representation lacks scale adaptability and suffers from insufficient resolution, which makes it difficult for the recognizer to extract the correct textual information for guiding the reconstruction network, we propose a multimodal semantic interaction-based text image super-resolution reconstruction method. By using the attention mask in the semantic reasoning module, we correct the textual content information, obtain semantic prior knowledge, and constrain and guide the network to reconstruct semantically accurate text super-resolution images. To enhance the network's representation capability and adapt to text images of different shapes and lengths, we design a multimodal semantic interaction block. Its basic components include a visual dual-stream integration block, a cross-modal adaptive fusion block, and an orthogonal bidirectional gated recurrent unit. Experimental results show that, on the Textzoom test set, our proposed method outperforms other mainstream methods in terms of PSNR and SSIM quantitative metrics, with average recognition accuracy improvements of 2.9%, 3.6%, and 3.7% on three recognizers (ASTER, MORAN, and CRNN) compared to the TPGSR model. These results demonstrate that the text image super-resolution reconstruction method based on multimodal semantic interaction can effectively improve text recognition accuracy.

Keywords: super-resolution reconstruction; text image; feature semantic prior; multi-modal

Multimodal Semantic Interaction for Text Image Super-Resolution

Rate this Article

Received Comments