Novelty Detection by Latent Semantic Indexing

As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of t...

Full description

Bibliographic Details
Main Author: Zhang, Xueshan
Language:en
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10012/7560
id ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-7560
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-75602013-10-04T04:12:25ZZhang, Xueshan2013-05-23T17:23:10Z2013-05-23T17:23:10Z2013-05-23T17:23:10Z2013http://hdl.handle.net/10012/7560As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.ennovelty detectionlatent semantic indexingNovelty Detection by Latent Semantic IndexingThesis or DissertationStatistics and Actuarial ScienceMaster of MathematicsStatistics
collection NDLTD
language en
sources NDLTD
topic novelty detection
latent semantic indexing
Statistics
spellingShingle novelty detection
latent semantic indexing
Statistics
Zhang, Xueshan
Novelty Detection by Latent Semantic Indexing
description As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.
author Zhang, Xueshan
author_facet Zhang, Xueshan
author_sort Zhang, Xueshan
title Novelty Detection by Latent Semantic Indexing
title_short Novelty Detection by Latent Semantic Indexing
title_full Novelty Detection by Latent Semantic Indexing
title_fullStr Novelty Detection by Latent Semantic Indexing
title_full_unstemmed Novelty Detection by Latent Semantic Indexing
title_sort novelty detection by latent semantic indexing
publishDate 2013
url http://hdl.handle.net/10012/7560
work_keys_str_mv AT zhangxueshan noveltydetectionbylatentsemanticindexing
_version_ 1716601090181431296