String Comparators for Chinese-Characters-Based Record Linkages

In the context of big data, data sharing between different institutions can not only reduce the cost of information collection greatly but also benefit for obtaining analysis results effectively and efficiently. Record linkage is the task of locating records that refer to the same entity from hetero...

Full description

Bibliographic Details
Main Authors:	Senlin Xu, Mingfan Zheng, Xinran Li
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Record linkage Chinese characters soundshape code string comparator Fellegi-Sunter model
Online Access:	https://ieeexplore.ieee.org/document/9310262/

id	doaj-d1df433526d44ae18d7b232493147b5b
record_format	Article
spelling	doaj-d1df433526d44ae18d7b232493147b5b2021-03-30T15:01:11ZengIEEEIEEE Access2169-35362021-01-0193735374310.1109/ACCESS.2020.30479279310262String Comparators for Chinese-Characters-Based Record LinkagesSenlin Xu0Mingfan Zheng1https://orcid.org/0000-0003-1711-7443Xinran Li2https://orcid.org/0000-0002-5678-6829Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, ChinaDepartment of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, ChinaDepartment of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, ChinaIn the context of big data, data sharing between different institutions can not only reduce the cost of information collection greatly but also benefit for obtaining analysis results effectively and efficiently. Record linkage is the task of locating records that refer to the same entity from heterogeneous data sources. In the last decades, extensive researches on alphabet-based record linkages have been carried out, among which the Fellegi-Sunter model extended by Winkler has outperformed others. However, it is still a challenge to perform record linkage on Chinese-character-based datasets. In this article, two set-based methods (Cosine similarity and Dice similarity) were introduced firstly, and then the similarity of Chinese characters was quantified based on an adapted encoding technique which exploits the information of both the shape and the pronunciation of Chinese character. A new method entitled Hybrid similarity was proposed in the next part, which is the combination of the character transformation technique (SoundShape Code) and Dice similarity. Finally, we performed the aforementioned methods on the simulated datasets, and each method was evaluated by counting the number of misclassified record pairs and the computational time. The results demonstrated that our Hybrid similarity method outperformed others in reducing the number of misclassified pairs with a relatively low computational cost.https://ieeexplore.ieee.org/document/9310262/Record linkageChinese characterssoundshape codestring comparatorFellegi-Sunter model
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Senlin Xu Mingfan Zheng Xinran Li
spellingShingle	Senlin Xu Mingfan Zheng Xinran Li String Comparators for Chinese-Characters-Based Record Linkages IEEE Access Record linkage Chinese characters soundshape code string comparator Fellegi-Sunter model
author_facet	Senlin Xu Mingfan Zheng Xinran Li
author_sort	Senlin Xu
title	String Comparators for Chinese-Characters-Based Record Linkages
title_short	String Comparators for Chinese-Characters-Based Record Linkages
title_full	String Comparators for Chinese-Characters-Based Record Linkages
title_fullStr	String Comparators for Chinese-Characters-Based Record Linkages
title_full_unstemmed	String Comparators for Chinese-Characters-Based Record Linkages
title_sort	string comparators for chinese-characters-based record linkages
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2021-01-01
description	In the context of big data, data sharing between different institutions can not only reduce the cost of information collection greatly but also benefit for obtaining analysis results effectively and efficiently. Record linkage is the task of locating records that refer to the same entity from heterogeneous data sources. In the last decades, extensive researches on alphabet-based record linkages have been carried out, among which the Fellegi-Sunter model extended by Winkler has outperformed others. However, it is still a challenge to perform record linkage on Chinese-character-based datasets. In this article, two set-based methods (Cosine similarity and Dice similarity) were introduced firstly, and then the similarity of Chinese characters was quantified based on an adapted encoding technique which exploits the information of both the shape and the pronunciation of Chinese character. A new method entitled Hybrid similarity was proposed in the next part, which is the combination of the character transformation technique (SoundShape Code) and Dice similarity. Finally, we performed the aforementioned methods on the simulated datasets, and each method was evaluated by counting the number of misclassified record pairs and the computational time. The results demonstrated that our Hybrid similarity method outperformed others in reducing the number of misclassified pairs with a relatively low computational cost.
topic	Record linkage Chinese characters soundshape code string comparator Fellegi-Sunter model
url	https://ieeexplore.ieee.org/document/9310262/
work_keys_str_mv	AT senlinxu stringcomparatorsforchinesecharactersbasedrecordlinkages AT mingfanzheng stringcomparatorsforchinesecharactersbasedrecordlinkages AT xinranli stringcomparatorsforchinesecharactersbasedrecordlinkages
_version_	1724180261090885632

String Comparators for Chinese-Characters-Based Record Linkages

Similar Items