| Summary: | Interactive tasks involving multi-modal data present advanced requirements for the comprehensive utilization of knowledge from different modalities, leading to the emergence of multi-modal knowledge graphs. When constructing these graphs, accurately determining whether image and text entities refer to the same object is particularly important for entity alignment of Chinese cross-modal entities. To address this problem, a Chinese cross-modal entity alignment method based on a multi-modal knowledge graph is proposed. Image information is introduced into the entity alignment task, and a single and dual-stream interactive pre-trained language model, namely CCMEA, is designed for domain-specific, fine-grained images and Chinese text. Utilizing a self-supervised learning method, Text-Visual features are extracted using Text-Visual Encoder, and fine-grained modeling is performed using cross-coders. Finally, a comparison learning method is employed to evaluate the degree of alignment between image and text entities. The experimental results show that the Mean Recall(MR) of the CCMEA model improved by 3.20 and 11.96 percentage points compared to that of the Wukong<sub>ViT-B</sub> baseline model on the MUGE and Flickr30k-CN datasets, respectively. Furthermore, the model achieved a remarkable MR of 94.3% on the self-built TEXTILE dataset. These results demonstrate that the proposed method can effectively align Chinese cross-modal entities with high accuracy in practical applications.
|