Automatically detecting authors' native language
Approved for public release; distribution is unlimited === When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a mo...
Main Author: | |
---|---|
Other Authors: | |
Published: |
Monterey, California. Naval Postgraduate School
2012
|
Online Access: | http://hdl.handle.net/10945/5821 |
id |
ndltd-nps.edu-oai-calhoun.nps.edu-10945-5821 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-nps.edu-oai-calhoun.nps.edu-10945-58212015-05-06T03:57:51Z Automatically detecting authors' native language Ahn, Charles S. Martell, Craig H. Anand, Pranav Naval Postgraduate School (U.S.) Computer Science Approved for public release; distribution is unlimited When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a model that exploits L1-L2 language transfer to identify the authors' native language, it would be an invaluable tool for the intelligence community as well as in the field of education. Therefore, the objective of this research is to find out if it is possible to automatically detect the author's native language based on his/her writing in English using traditional machine learning techniques. For this research, we used eight different collections of writings by speakers of eight different nationalities: native English speakers as well as speakers of Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. Among the various feature sets used in this research, character trigrams and bag of words alone achieved higher than 80% accuracy, and the empirical analysis of character trigrams revealed that the character trigrams just model lexical usage. When content words were extracted, the performance dropped and the results revealed that the topic words were doing all the work. 2012-03-14T17:46:50Z 2012-03-14T17:46:50Z 2011-03 Thesis http://hdl.handle.net/10945/5821 720330748 This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. As such, it is in the public domain, and under the provisions of Title 17, United States Code, Section 105, it may not be copyrighted. Monterey, California. Naval Postgraduate School |
collection |
NDLTD |
sources |
NDLTD |
description |
Approved for public release; distribution is unlimited === When non-native speakers learn English, their first language influences how they learn. This is known as L1-L2 language transfer, and linguistic studies have shown that these language transfers can affect writing as well. If there were a model that exploits L1-L2 language transfer to identify the authors' native language, it would be an invaluable tool for the intelligence community as well as in the field of education. Therefore, the objective of this research is to find out if it is possible to automatically detect the author's native language based on his/her writing in English using traditional machine learning techniques. For this research, we used eight different collections of writings by speakers of eight different nationalities: native English speakers as well as speakers of Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. Among the various feature sets used in this research, character trigrams and bag of words alone achieved higher than 80% accuracy, and the empirical analysis of character trigrams revealed that the character trigrams just model lexical usage. When content words were extracted, the performance dropped and the results revealed that the topic words were doing all the work. |
author2 |
Martell, Craig H. |
author_facet |
Martell, Craig H. Ahn, Charles S. |
author |
Ahn, Charles S. |
spellingShingle |
Ahn, Charles S. Automatically detecting authors' native language |
author_sort |
Ahn, Charles S. |
title |
Automatically detecting authors' native language |
title_short |
Automatically detecting authors' native language |
title_full |
Automatically detecting authors' native language |
title_fullStr |
Automatically detecting authors' native language |
title_full_unstemmed |
Automatically detecting authors' native language |
title_sort |
automatically detecting authors' native language |
publisher |
Monterey, California. Naval Postgraduate School |
publishDate |
2012 |
url |
http://hdl.handle.net/10945/5821 |
work_keys_str_mv |
AT ahncharless automaticallydetectingauthorsnativelanguage |
_version_ |
1716802847636455424 |