Downloads: 1
China | Computer Science and Engineering | Volume 14 Issue 3, March 2026 | Pages: 21 - 26
STQNet-VL: Vision-Language Guided Multi-Modal Learning for Multi-View Soccer Foul Recognition
Abstract: This paper addresses the problem of multi-view soccer foul recognition, which aims to identify foul actions and classify their severity from broadcast videos. Existing methods rely solely on visual information and often suffer from ambiguous predictions under occlusion, motion blur, and controversial contact scenarios. To overcome these limitations, we propose STQNet-VL, a vision-language guided multi-modal framework built upon the spatial-temporal query network (STQNet). The model first employs a large vision-language model to generate fine-grained textual descriptions of player interactions and contact regions. The descriptions are encoded using a pre-trained CLIP text encoder to ensure alignment with visual features. A transformer-based multi-modal feature fusion module is then designed to enable deep cross-attention interaction between visual and textual representations. Experimental results on the SoccerNet-MVFouls dataset demonstrate that STQNet-VL achieves 53% BAact, 45% BAsev, and 49% overall balanced accuracy, outperforming prior state-of-the-art methods and improving balanced accuracy by 2% over STQNet-Large. These results validate the effectiveness of integrating language priors for robust multi-view sports action understanding.
Keywords: multi-view action recognition; vision-language models; transformer-based fusion; cross-modal learning; sports video analysis; Soccer Net-MVFouls; spatial-temporal modeling; class imbalance learning