Retrieval of Scientific Documents Based on HFS and BERT

When retrieving scientific documents with mathematical expressions as the main content, both mathematical expressions and their contextual text features require consideration. However, mathematical expressions are different from texts in terms of grammar and semantics. Thus, integrating the above fe...

Full description

Bibliographic Details
Main Authors: Xuedong Tian, Jiameng Wang
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
HFS
Online Access:https://ieeexplore.ieee.org/document/9314107/
Description
Summary:When retrieving scientific documents with mathematical expressions as the main content, both mathematical expressions and their contextual text features require consideration. However, mathematical expressions are different from texts in terms of grammar and semantics. Thus, integrating the above features and realizing scientific document retrieval is difficult. In this study, a retrieval method of scientific documents based on HFS (Hesitation Fuzzy Sets) and BERT (Bidirectional Encoder Representations from Transformer) is proposed. This method is realized through utilizing the advantages of HFS in multi-attribute decision making and BERT in context-dependent similarity calculation. By analyzing mathematical expressions and calculating the membership degree of symbolic multi-attributes, the similarity of mathematical expressions can be obtained, which can improve the accuracy of mathematical expression recall. With the extraction of the text of the expression context, BERT is used to calculate the context similarity. Then, the recalled technical documents are sorted according to the similarity of context, and the final retrieval result can be obtained. Experiments were carried out on 10,372 Chinese and 11,770 English scientific documents in the NTCIR extended data set. The average value of MAP_k(k = 10) for the recall results of scientific documents was 74.13%. The average nDCG (n = 10) for the ranking of scientific documents was 86.04%.
ISSN:2169-3536