Term Translation Extraction Using Web Mining Techniques

博士 === 國立交通大學 === 資訊工程系 === 92 === The Web is becoming the largest data repository in the world. How to discover knowledge in diverse data resources on the Web for benefiting Web-based information systems is being studied in the emerging research area of Web mining. Multilingual terminolo...

Full description

Bibliographic Details
Main Authors: Wen-Hsiang Lu, 盧文祥
Other Authors: Hsi-Jian Lee
Format: Others
Language:en_US
Published: 2003
Online Access:http://ndltd.ncl.edu.tw/handle/61958058418911296232
id ndltd-TW-092NCTU0392003
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立交通大學 === 資訊工程系 === 92 === The Web is becoming the largest data repository in the world. How to discover knowledge in diverse data resources on the Web for benefiting Web-based information systems is being studied in the emerging research area of Web mining. Multilingual terminological resources, such as multilingual lexicons or thesauri, are valuable for conducting academic researches or developing practical applications, such as machine translation (MT), cross-language information retrieval (CLIR), or even information exchange in electronic commerce (EC). However, manual lexicography is time-consuming and not cost-effective. It is worthwhile to automatically construct multilingual translation lexicons by mining the Web content, which consists of huge amounts of multilingual and wide-scoped hypertext resources. To deal with the automatic construction of multilingual translation lexicons, conventional methods mostly relied on bilingual parallel text corpora. However, the unavailability of adequate parallel corpora for various subject domains and multiple languages is still a thorny situation. On the other hand, in Web search most users prefer to issue short queries which often contain unknown terms (not included in general-purpose dictionaries). For CLIR, it might fail if any query term in short queries can not be translated correctly. Different from previous works, in this dissertation, we particularly focus on exploring large amounts of Web data resources as live corpora for the analysis of term translation, and propose effective Web mining approaches to solving the above problems of automatic multilingual term translation, especially for diverse, short, unknown Web query terms. Some valuable results in our works are presented as follows:  Mining of Web resources: We first discover the possibility of automatically extracting multilingual translations of diverse unknown terms through two kinds of multilingual Web resources, i.e., Web anchor texts and search-result pages. In particular, such Web resources contain considerably rich mixed texts in which many term translations co-occur frequently. Fortunately, these resources are easily collected from the Web which is growing fast. To some extent, our works present a new research issue for further investigation into Web mining.  Innovative approaches: We have first proposed an effective probabilistic inference model to make good use of Web anchor texts and link structures for translation extraction. To extend the probabilistic inference model for further discovering multilingual translations, we have proposed a more generalized transitive translation model. To increase the coverage rate of translation for diverse Web query terms, thus, we have employed chi-square test and context-vector analysis to exploit search-result pages. Furthermore, we have also developed an integrated Web mining approach to fully utilizing the two kinds of Web resources for enhancing the precision and coverage rates of translation.  Promising results in large-scale tests: To evaluate the translation performance of our proposed integrated Web mining approach, we have collected different kinds of test term sets, including two large query logs containing 228,566 and 114,182 query terms from real-world search engines, proper names and technical terms from Yahoo! Web Site Directory. The approach achieved about 67% and 44% accuracy for randomly selected 622 frequent Web query terms and 50 technical terms, respectively. Importantly, most of our test terms are unknown (about 64% to 82%). Additionally, the approach is easy to combine with the probabilistic retrieval model for improving CLIR performance. Some experiments on the NCTIR-2 English-Chinese cross-language retrieval task showed that our approach actually have the potential to deal with the problems of translations of short queries. Also, the MAP (mean average precision) value of CLIR was obviously increased from 0.207 (dictionary-based approach) to 0.241.  Developing a practical cross-language Web search system: Most of existing CLIR systems relying on build-in translation dictionaries were, in fact, unable to deal with translations of many unknown Web queries. Therefore, such systems have not lived up to expectations to provide practical cross-language Web search (CLWS) services. To fulfil user’s requirements of CLWS services based on our integrated Web mining approach, we have developed a CLWS system, called LiveTrans. Currently, the system can automatically generate effective translation suggestions for diverse unknown Web queries and provide practical CLWS services for the retrieval of both Web pages and images in the language pairs among English, Chinese, Japanese and Korean.
author2 Hsi-Jian Lee
author_facet Hsi-Jian Lee
Wen-Hsiang Lu
盧文祥
author Wen-Hsiang Lu
盧文祥
spellingShingle Wen-Hsiang Lu
盧文祥
Term Translation Extraction Using Web Mining Techniques
author_sort Wen-Hsiang Lu
title Term Translation Extraction Using Web Mining Techniques
title_short Term Translation Extraction Using Web Mining Techniques
title_full Term Translation Extraction Using Web Mining Techniques
title_fullStr Term Translation Extraction Using Web Mining Techniques
title_full_unstemmed Term Translation Extraction Using Web Mining Techniques
title_sort term translation extraction using web mining techniques
publishDate 2003
url http://ndltd.ncl.edu.tw/handle/61958058418911296232
work_keys_str_mv AT wenhsianglu termtranslationextractionusingwebminingtechniques
AT lúwénxiáng termtranslationextractionusingwebminingtechniques
AT wenhsianglu yǐwǎnglùtànkānwèijīchǔzhīshùyǔfānyìxiéqǔjìshù
AT lúwénxiáng yǐwǎnglùtànkānwèijīchǔzhīshùyǔfānyìxiéqǔjìshù
_version_ 1718306984613117952
spelling ndltd-TW-092NCTU03920032016-06-17T04:16:03Z http://ndltd.ncl.edu.tw/handle/61958058418911296232 Term Translation Extraction Using Web Mining Techniques 以網路探勘為基礎之術語翻譯擷取技術 Wen-Hsiang Lu 盧文祥 博士 國立交通大學 資訊工程系 92 The Web is becoming the largest data repository in the world. How to discover knowledge in diverse data resources on the Web for benefiting Web-based information systems is being studied in the emerging research area of Web mining. Multilingual terminological resources, such as multilingual lexicons or thesauri, are valuable for conducting academic researches or developing practical applications, such as machine translation (MT), cross-language information retrieval (CLIR), or even information exchange in electronic commerce (EC). However, manual lexicography is time-consuming and not cost-effective. It is worthwhile to automatically construct multilingual translation lexicons by mining the Web content, which consists of huge amounts of multilingual and wide-scoped hypertext resources. To deal with the automatic construction of multilingual translation lexicons, conventional methods mostly relied on bilingual parallel text corpora. However, the unavailability of adequate parallel corpora for various subject domains and multiple languages is still a thorny situation. On the other hand, in Web search most users prefer to issue short queries which often contain unknown terms (not included in general-purpose dictionaries). For CLIR, it might fail if any query term in short queries can not be translated correctly. Different from previous works, in this dissertation, we particularly focus on exploring large amounts of Web data resources as live corpora for the analysis of term translation, and propose effective Web mining approaches to solving the above problems of automatic multilingual term translation, especially for diverse, short, unknown Web query terms. Some valuable results in our works are presented as follows:  Mining of Web resources: We first discover the possibility of automatically extracting multilingual translations of diverse unknown terms through two kinds of multilingual Web resources, i.e., Web anchor texts and search-result pages. In particular, such Web resources contain considerably rich mixed texts in which many term translations co-occur frequently. Fortunately, these resources are easily collected from the Web which is growing fast. To some extent, our works present a new research issue for further investigation into Web mining.  Innovative approaches: We have first proposed an effective probabilistic inference model to make good use of Web anchor texts and link structures for translation extraction. To extend the probabilistic inference model for further discovering multilingual translations, we have proposed a more generalized transitive translation model. To increase the coverage rate of translation for diverse Web query terms, thus, we have employed chi-square test and context-vector analysis to exploit search-result pages. Furthermore, we have also developed an integrated Web mining approach to fully utilizing the two kinds of Web resources for enhancing the precision and coverage rates of translation.  Promising results in large-scale tests: To evaluate the translation performance of our proposed integrated Web mining approach, we have collected different kinds of test term sets, including two large query logs containing 228,566 and 114,182 query terms from real-world search engines, proper names and technical terms from Yahoo! Web Site Directory. The approach achieved about 67% and 44% accuracy for randomly selected 622 frequent Web query terms and 50 technical terms, respectively. Importantly, most of our test terms are unknown (about 64% to 82%). Additionally, the approach is easy to combine with the probabilistic retrieval model for improving CLIR performance. Some experiments on the NCTIR-2 English-Chinese cross-language retrieval task showed that our approach actually have the potential to deal with the problems of translations of short queries. Also, the MAP (mean average precision) value of CLIR was obviously increased from 0.207 (dictionary-based approach) to 0.241.  Developing a practical cross-language Web search system: Most of existing CLIR systems relying on build-in translation dictionaries were, in fact, unable to deal with translations of many unknown Web queries. Therefore, such systems have not lived up to expectations to provide practical cross-language Web search (CLWS) services. To fulfil user’s requirements of CLWS services based on our integrated Web mining approach, we have developed a CLWS system, called LiveTrans. Currently, the system can automatically generate effective translation suggestions for diverse unknown Web queries and provide practical CLWS services for the retrieval of both Web pages and images in the language pairs among English, Chinese, Japanese and Korean. Hsi-Jian Lee Lee-Feng Chien 李錫堅 簡立峰 2003 學位論文 ; thesis 96 en_US