Parameter free document stream classification.

Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH...

Full description

Bibliographic Details
Other Authors: Fung, Pui Cheong Gabriel.
Format: Others
Language:English
Chinese
Published: 2006
Subjects:
Online Access:http://library.cuhk.edu.hk/record=b6074286
http://repository.lib.cuhk.edu.hk/en/item/cuhk-343915
id ndltd-cuhk.edu.hk-oai-cuhk-dr-cuhk_343915
record_format oai_dc
collection NDLTD
language English
Chinese
format Others
sources NDLTD
topic Classification
Data mining--Mathematical models
Database management
Information storage and retrieval systems
spellingShingle Classification
Data mining--Mathematical models
Database management
Information storage and retrieval systems
Parameter free document stream classification.
description Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible. === For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted. === For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U. === In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically. === In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval. === Fung Pui Cheong Gabriel. === "August 2006." === Adviser: Jeffrey Xu Yu. === Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720. === Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. === Includes bibliographical references (p. 122-130). === Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Abstracts in English and Chinese. === School code: 1307.
author2 Fung, Pui Cheong Gabriel.
author_facet Fung, Pui Cheong Gabriel.
title Parameter free document stream classification.
title_short Parameter free document stream classification.
title_full Parameter free document stream classification.
title_fullStr Parameter free document stream classification.
title_full_unstemmed Parameter free document stream classification.
title_sort parameter free document stream classification.
publishDate 2006
url http://library.cuhk.edu.hk/record=b6074286
http://repository.lib.cuhk.edu.hk/en/item/cuhk-343915
_version_ 1718978219532615680
spelling ndltd-cuhk.edu.hk-oai-cuhk-dr-cuhk_3439152019-02-19T03:43:54Z Parameter free document stream classification. CUHK electronic theses & dissertations collection Classification Data mining--Mathematical models Database management Information storage and retrieval systems Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible. For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted. For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U. In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically. In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval. Fung Pui Cheong Gabriel. "August 2006." Adviser: Jeffrey Xu Yu. Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720. Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. Includes bibliographical references (p. 122-130). Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. Abstracts in English and Chinese. School code: 1307. Fung, Pui Cheong Gabriel. Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management. 2006 Text theses electronic resource microform microfiche 1 online resource (xvi, 130 p. : ill.) cuhk:343915 http://library.cuhk.edu.hk/record=b6074286 eng chi Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/) http://repository.lib.cuhk.edu.hk/en/islandora/object/cuhk%3A343915/datastream/TN/view/Parameter%20free%20document%20stream%20classification.jpghttp://repository.lib.cuhk.edu.hk/en/item/cuhk-343915