Corpus and sentiment analysis

Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notab...

Full description

Bibliographic Details
Main Author: Cheng, Tai Wai David
Published: University of Surrey 2007
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442666
id ndltd-bl.uk-oai-ethos.bl.uk-442666
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-4426662015-08-04T03:33:15ZCorpus and sentiment analysisCheng, Tai Wai David2007Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notably Chinese and Arabic. A number of methods have been developed for analysing `free' natural language texts. Furthermore, a number of systems for understanding messages have been developed, focusing on the area of named entity extraction, templates for dealing with certain kinds of news. The templates were handcrafted, and a lot of ad-hoc knowledge went into the creation of such systems. Seven of these systems have been reviewed. Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The extensive use of gazetteers and manually crafted grammar rules further limits the portability of the existing IE systems to work language and domain independently. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a predefined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information as unambiguously as possible from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging. Our preliminarily evaluation, based on the correlation between a time series of positive (negative) sentiment word (phrase) counts with a time series of indices produced by stock exchanges (Financial Times Stock Exchange, Dow Jones Industrial Average, Nasdaq, S&P 500, Hang Seng Index, Shanghai Index, and Shenzhen Index) showed that when the positive (negative) sentiment series correlates with the stock exchange index, the negative (positive) shows a smaller degree of correlation and in many cases a degree of anti-correlation. Any interpretation of our result requires a careful econometrically well grounded analysis of the financial time series - this is beyond the scope of this work.006.35University of Surreyhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442666http://epubs.surrey.ac.uk/2744/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 006.35
spellingShingle 006.35
Cheng, Tai Wai David
Corpus and sentiment analysis
description Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notably Chinese and Arabic. A number of methods have been developed for analysing `free' natural language texts. Furthermore, a number of systems for understanding messages have been developed, focusing on the area of named entity extraction, templates for dealing with certain kinds of news. The templates were handcrafted, and a lot of ad-hoc knowledge went into the creation of such systems. Seven of these systems have been reviewed. Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The extensive use of gazetteers and manually crafted grammar rules further limits the portability of the existing IE systems to work language and domain independently. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a predefined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information as unambiguously as possible from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging. Our preliminarily evaluation, based on the correlation between a time series of positive (negative) sentiment word (phrase) counts with a time series of indices produced by stock exchanges (Financial Times Stock Exchange, Dow Jones Industrial Average, Nasdaq, S&P 500, Hang Seng Index, Shanghai Index, and Shenzhen Index) showed that when the positive (negative) sentiment series correlates with the stock exchange index, the negative (positive) shows a smaller degree of correlation and in many cases a degree of anti-correlation. Any interpretation of our result requires a careful econometrically well grounded analysis of the financial time series - this is beyond the scope of this work.
author Cheng, Tai Wai David
author_facet Cheng, Tai Wai David
author_sort Cheng, Tai Wai David
title Corpus and sentiment analysis
title_short Corpus and sentiment analysis
title_full Corpus and sentiment analysis
title_fullStr Corpus and sentiment analysis
title_full_unstemmed Corpus and sentiment analysis
title_sort corpus and sentiment analysis
publisher University of Surrey
publishDate 2007
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442666
work_keys_str_mv AT chengtaiwaidavid corpusandsentimentanalysis
_version_ 1716815526808780800