Novelty Detection by Latent Semantic Indexing

As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of t...

Full description

Bibliographic Details
Main Author:	Zhang, Xueshan
Language:	en
Published:	2013
Subjects:	novelty detection latent semantic indexing Statistics
Online Access:	http://hdl.handle.net/10012/7560

id	ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-7560
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-75602013-10-04T04:12:25ZZhang, Xueshan2013-05-23T17:23:10Z2013-05-23T17:23:10Z2013-05-23T17:23:10Z2013http://hdl.handle.net/10012/7560As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.ennovelty detectionlatent semantic indexingNovelty Detection by Latent Semantic IndexingThesis or DissertationStatistics and Actuarial ScienceMaster of MathematicsStatistics
collection	NDLTD
language	en
sources	NDLTD
topic	novelty detection latent semantic indexing Statistics
spellingShingle	novelty detection latent semantic indexing Statistics Zhang, Xueshan Novelty Detection by Latent Semantic Indexing
description	As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.
author	Zhang, Xueshan
author_facet	Zhang, Xueshan
author_sort	Zhang, Xueshan
title	Novelty Detection by Latent Semantic Indexing
title_short	Novelty Detection by Latent Semantic Indexing
title_full	Novelty Detection by Latent Semantic Indexing
title_fullStr	Novelty Detection by Latent Semantic Indexing
title_full_unstemmed	Novelty Detection by Latent Semantic Indexing
title_sort	novelty detection by latent semantic indexing
publishDate	2013
url	http://hdl.handle.net/10012/7560
work_keys_str_mv	AT zhangxueshan noveltydetectionbylatentsemanticindexing
_version_	1716601090181431296

Novelty Detection by Latent Semantic Indexing

Similar Items