Automatically detecting authors' native language

Approved for public release; distribution is unlimited === When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a mo...

Full description

Bibliographic Details
Main Author: Ahn, Charles S.
Other Authors: Martell, Craig H.
Published: Monterey, California. Naval Postgraduate School 2012
Online Access:http://hdl.handle.net/10945/5821
id ndltd-nps.edu-oai-calhoun.nps.edu-10945-5821
record_format oai_dc
spelling ndltd-nps.edu-oai-calhoun.nps.edu-10945-58212015-05-06T03:57:51Z Automatically detecting authors' native language Ahn, Charles S. Martell, Craig H. Anand, Pranav Naval Postgraduate School (U.S.) Computer Science Approved for public release; distribution is unlimited When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a model that exploits L1-L2 language transfer to identify the authors' native language, it would be an invaluable tool for the intelligence community as well as in the field of education. Therefore, the objective of this research is to find out if it is possible to automatically detect the author's native language based on his/her writing in English using traditional machine learning techniques. For this research, we used eight different collections of writings by speakers of eight different nationalities: native English speakers as well as speakers of Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. Among the various feature sets used in this research, character trigrams and bag of words alone achieved higher than 80% accuracy, and the empirical analysis of character trigrams revealed that the character trigrams just model lexical usage. When content words were extracted, the performance dropped and the results revealed that the topic words were doing all the work. 2012-03-14T17:46:50Z 2012-03-14T17:46:50Z 2011-03 Thesis http://hdl.handle.net/10945/5821 720330748 This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. As such, it is in the public domain, and under the provisions of Title 17, United States Code, Section 105, it may not be copyrighted. Monterey, California. Naval Postgraduate School
collection NDLTD
sources NDLTD
description Approved for public release; distribution is unlimited === When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a model that exploits L1-L2 language transfer to identify the authors' native language, it would be an invaluable tool for the intelligence community as well as in the field of education. Therefore, the objective of this research is to find out if it is possible to automatically detect the author's native language based on his/her writing in English using traditional machine learning techniques. For this research, we used eight different collections of writings by speakers of eight different nationalities: native English speakers as well as speakers of Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. Among the various feature sets used in this research, character trigrams and bag of words alone achieved higher than 80% accuracy, and the empirical analysis of character trigrams revealed that the character trigrams just model lexical usage. When content words were extracted, the performance dropped and the results revealed that the topic words were doing all the work.
author2 Martell, Craig H.
author_facet Martell, Craig H.
Ahn, Charles S.
author Ahn, Charles S.
spellingShingle Ahn, Charles S.
Automatically detecting authors' native language
author_sort Ahn, Charles S.
title Automatically detecting authors' native language
title_short Automatically detecting authors' native language
title_full Automatically detecting authors' native language
title_fullStr Automatically detecting authors' native language
title_full_unstemmed Automatically detecting authors' native language
title_sort automatically detecting authors' native language
publisher Monterey, California. Naval Postgraduate School
publishDate 2012
url http://hdl.handle.net/10945/5821
work_keys_str_mv AT ahncharless automaticallydetectingauthorsnativelanguage
_version_ 1716802847636455424