A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)

Nowadays we are witnessing the dramatic growth of utilizing corpus-based studies in linguistics known as corpus linguistics. The current research aims to study the improvement of frequency techniques in Farsi Language and has been conducted in order to achieve a scientific approach in automatic term...

Full description

Bibliographic Details
Main Authors:	Zohreh Zolfaghar, Tayebeh Mosavi Miangah, Belghis Rovshan, Amir Reza Vakilifard
Format:	Article
Language:	fas
Published:	Iranian Research Institute for Information and Technology 2020-07-01
Series:	Iranian Journal of Information Processing & Management
Subjects:	automatic term extraction medicine vocabulary corpus hybrid extraction methods farsi language teaching information retrieval
Online Access:	http://jipm.irandoc.ac.ir/article-1-4057-en.html

id	doaj-e2baac66923b40f4b8916c50e6515ebd
record_format	Article
spelling	doaj-e2baac66923b40f4b8916c50e6515ebd2020-11-25T03:33:33ZfasIranian Research Institute for Information and TechnologyIranian Journal of Information Processing & Management2251-82232251-82312020-07-0135410391064A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)Zohreh Zolfaghar0Tayebeh Mosavi Miangah1Belghis Rovshan2Amir Reza Vakilifard3 Payame Noor University, Tehran;Iran Payame Noor University, Tehran;Iran Payame Noor University, Tehran;Iran Imam Khomeini University of Qazvin; Qazvin;Iran Nowadays we are witnessing the dramatic growth of utilizing corpus-based studies in linguistics known as corpus linguistics. The current research aims to study the improvement of frequency techniques in Farsi Language and has been conducted in order to achieve a scientific approach in automatic term extraction focused on extracting basic medicine terms. Using statistical approaches along with corpus linguistic tools (hybrid extraction methods) for automatic term extraction purposes, have become quite common in a number of languages such as English, French, Japanese and Korean. So far, these approaches have not been utilized in Farsi language widely and most of the efforts for term extraction have been conducted in traditional ways. On the other hand, these approaches are language specific and it is not possible to use them for a different language. They should be modified based on the properties of the target language in order to achieve an extraction method which is appropriate for that language. To do so, a group of frequency models with approaches to count frequency in a main corpus and a special corpus and their improved methods have been utilized. The frequency method used in this study has counted the terms in a general and a main corpus which is created by the researcher. These corpuses are formed from the texts in science textbooks of Iran High schools (grades 9-12), science text books of Iran middle schools (grade 7-8), the science texts taught in the Qazvin Imam Khomeini Farsi Language Center and some journals and articles on general science. Achieved results show that there is a potential possibility to extract terms automatically in Farsi. Among the major challenges of utilizing the simple methods we can refer to the process of separating high frequency words such as coordinators or prepositions. Therefore, to increase the power of this model, we improved the basic models by applying some techniques on the them. It is observed that the improved frequency method has shown a better performance in the special corpus as opposed to other methods and has been able to predict up to 60% of the special vocabulary in the first 50 high frequency extracted vocabulary. On the other hand, other results of the study show that the presence of low frequency vocabulary in the general corpus with a frequency similar to the frequency of special vocabulary, has led to achieving weaker results than the simple method.http://jipm.irandoc.ac.ir/article-1-4057-en.htmlautomatic term extractionmedicine vocabularycorpushybrid extraction methodsfarsi language teachinginformation retrieval
collection	DOAJ
language	fas
format	Article
sources	DOAJ
author	Zohreh Zolfaghar Tayebeh Mosavi Miangah Belghis Rovshan Amir Reza Vakilifard
spellingShingle	Zohreh Zolfaghar Tayebeh Mosavi Miangah Belghis Rovshan Amir Reza Vakilifard A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary) Iranian Journal of Information Processing & Management automatic term extraction medicine vocabulary corpus hybrid extraction methods farsi language teaching information retrieval
author_facet	Zohreh Zolfaghar Tayebeh Mosavi Miangah Belghis Rovshan Amir Reza Vakilifard
author_sort	Zohreh Zolfaghar
title	A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)
title_short	A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)
title_full	A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)
title_fullStr	A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)
title_full_unstemmed	A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)
title_sort	study on the improved techniques of corpus-based frequency approaches in automatic term extraction (ate)(the case study: basic medicine vocabulary)
publisher	Iranian Research Institute for Information and Technology
series	Iranian Journal of Information Processing & Management
issn	2251-8223 2251-8231
publishDate	2020-07-01
description	Nowadays we are witnessing the dramatic growth of utilizing corpus-based studies in linguistics known as corpus linguistics. The current research aims to study the improvement of frequency techniques in Farsi Language and has been conducted in order to achieve a scientific approach in automatic term extraction focused on extracting basic medicine terms. Using statistical approaches along with corpus linguistic tools (hybrid extraction methods) for automatic term extraction purposes, have become quite common in a number of languages such as English, French, Japanese and Korean. So far, these approaches have not been utilized in Farsi language widely and most of the efforts for term extraction have been conducted in traditional ways. On the other hand, these approaches are language specific and it is not possible to use them for a different language. They should be modified based on the properties of the target language in order to achieve an extraction method which is appropriate for that language. To do so, a group of frequency models with approaches to count frequency in a main corpus and a special corpus and their improved methods have been utilized. The frequency method used in this study has counted the terms in a general and a main corpus which is created by the researcher. These corpuses are formed from the texts in science textbooks of Iran High schools (grades 9-12), science text books of Iran middle schools (grade 7-8), the science texts taught in the Qazvin Imam Khomeini Farsi Language Center and some journals and articles on general science. Achieved results show that there is a potential possibility to extract terms automatically in Farsi. Among the major challenges of utilizing the simple methods we can refer to the process of separating high frequency words such as coordinators or prepositions. Therefore, to increase the power of this model, we improved the basic models by applying some techniques on the them. It is observed that the improved frequency method has shown a better performance in the special corpus as opposed to other methods and has been able to predict up to 60% of the special vocabulary in the first 50 high frequency extracted vocabulary. On the other hand, other results of the study show that the presence of low frequency vocabulary in the general corpus with a frequency similar to the frequency of special vocabulary, has led to achieving weaker results than the simple method.
topic	automatic term extraction medicine vocabulary corpus hybrid extraction methods farsi language teaching information retrieval
url	http://jipm.irandoc.ac.ir/article-1-4057-en.html
work_keys_str_mv	AT zohrehzolfaghar astudyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT tayebehmosavimiangah astudyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT belghisrovshan astudyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT amirrezavakilifard astudyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT zohrehzolfaghar studyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT tayebehmosavimiangah studyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT belghisrovshan studyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary AT amirrezavakilifard studyontheimprovedtechniquesofcorpusbasedfrequencyapproachesinautomatictermextractionatethecasestudybasicmedicinevocabulary
_version_	1724563002573717504

A study on the improved techniques of corpus-based frequency approaches in Automatic Term Extraction (ATE)(The case study: basic medicine vocabulary)

Similar Items