Granular computing for imbalanced data: theory and applications

博士 === 國立交通大學 === 工業工程與管理系所 === 94 === In recent years, the development of machine learning techniques has provided an effective avenue for classification problems. However, when learning from imbalanced data, the traditional methods have poor predictive ability to identify minority instances. This...

Full description

Bibliographic Details
Main Authors: Chen, Long-Sheng, 陳隆昇
Other Authors: Su, Chao-Ton
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/83930875471713976916
id ndltd-TW-094NCTU5031018
record_format oai_dc
spelling ndltd-TW-094NCTU50310182016-06-03T04:14:19Z http://ndltd.ncl.edu.tw/handle/83930875471713976916 Granular computing for imbalanced data: theory and applications 粒化計算處理不平衡資料之理論與應用 Chen, Long-Sheng 陳隆昇 博士 國立交通大學 工業工程與管理系所 94 In recent years, the development of machine learning techniques has provided an effective avenue for classification problems. However, when learning from imbalanced data, the traditional methods have poor predictive ability to identify minority instances. This problem is of crucial importance since it is encountered by a large number of domains of great environmental, vital or commercial importance such as fraud detection, text mining, spam detection, medical diagnosis and fault monitoring/inspection. In this study, we propose novel methods called “Granular Computing” models to tackle class imbalance problems. Granular computing, which is oriented towards representing and processing Information Granules (IGs), is a computing paradigm that embraces a number of modeling frameworks. GrC imitates human instincts of processing information and is becoming a very important issue for computer science, logic, philosophy and others. When describing a problem which involves incomplete, uncertain, or vague information, we human beings tend to shy away from numbers and use aggregates to ponder the question instead. We are forced to consider IGs which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability. GrC model not only can remove unnecessary details and provide a better insight into the essence of data, but also effectively solve class imbalance problems. This study aims to develop two kinds of GrC models, “Knowledge Acquisition via Information Granulation” (KAIG) model and “Information Granules based method” (IG based method), for dealing with discrete and continuous data, respectively. In both models, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio) are successfully introduced to determine a suitable level of granularity (i.e. determine suitable number of IGs). Fuzzy Adaptive Resonance Theory (Fuzzy ART) neural network is utilized to construct IGs. In addition, we propose the concept of “sub-attributes” to describe granules and tackle the overlapping among granules in KAIG model. In IG based method, data characteristics are employed to represent IGs. The main objectives of this study are: 1. Develop a KAIG model to construct IGs, and to discover knowledge from IGs. Seven data sets from UCI data bank (including one imbalanced diagnosis data), are provided to evaluate the effectiveness of KAIG model. By using different performance indexes, Overall Accuracy, G-mean and ROC curve, the experimental results comparing with C4.5 and Support Vector Machine (SVM) demonstrate the superiority of our method. 2. Apply KAIG model to solve class imbalance problems in industrial engineering related areas. First, KAIG model is utilized to improve the classification performance of a dynamic scheduling system within a simulated Flexible Manufacturing System environment. Second, a real case of cellular phones inspection is provided to illustrate the excellent ability of KAIG model in identifying rare defective products. In addition, KAIG model can reduce redundant test items and shorten inspection time. For imbalanced data, these applications show KAIG model can dramatically increase Negative Accuracy (the capability of detecting minor instances) without losing Overall Accuracy. 3. Propose IG based method to deal with continuous imbalanced data. In this method, different data characteristics and their combinations are employed to denote constructed IGs. Then we build a classifier from these representatives of IGs. An actual medical diagnosis data of diabetes is used to evaluate the effectiveness of this method. Compared with traditional techniques, the proposed method is shown to be superior for learning on imbalanced data. Su, Chao-Ton Li, Rong-Kwei 蘇朝墩 李榮貴 2006 學位論文 ; thesis 89 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立交通大學 === 工業工程與管理系所 === 94 === In recent years, the development of machine learning techniques has provided an effective avenue for classification problems. However, when learning from imbalanced data, the traditional methods have poor predictive ability to identify minority instances. This problem is of crucial importance since it is encountered by a large number of domains of great environmental, vital or commercial importance such as fraud detection, text mining, spam detection, medical diagnosis and fault monitoring/inspection. In this study, we propose novel methods called “Granular Computing” models to tackle class imbalance problems. Granular computing, which is oriented towards representing and processing Information Granules (IGs), is a computing paradigm that embraces a number of modeling frameworks. GrC imitates human instincts of processing information and is becoming a very important issue for computer science, logic, philosophy and others. When describing a problem which involves incomplete, uncertain, or vague information, we human beings tend to shy away from numbers and use aggregates to ponder the question instead. We are forced to consider IGs which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability. GrC model not only can remove unnecessary details and provide a better insight into the essence of data, but also effectively solve class imbalance problems. This study aims to develop two kinds of GrC models, “Knowledge Acquisition via Information Granulation” (KAIG) model and “Information Granules based method” (IG based method), for dealing with discrete and continuous data, respectively. In both models, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio) are successfully introduced to determine a suitable level of granularity (i.e. determine suitable number of IGs). Fuzzy Adaptive Resonance Theory (Fuzzy ART) neural network is utilized to construct IGs. In addition, we propose the concept of “sub-attributes” to describe granules and tackle the overlapping among granules in KAIG model. In IG based method, data characteristics are employed to represent IGs. The main objectives of this study are: 1. Develop a KAIG model to construct IGs, and to discover knowledge from IGs. Seven data sets from UCI data bank (including one imbalanced diagnosis data), are provided to evaluate the effectiveness of KAIG model. By using different performance indexes, Overall Accuracy, G-mean and ROC curve, the experimental results comparing with C4.5 and Support Vector Machine (SVM) demonstrate the superiority of our method. 2. Apply KAIG model to solve class imbalance problems in industrial engineering related areas. First, KAIG model is utilized to improve the classification performance of a dynamic scheduling system within a simulated Flexible Manufacturing System environment. Second, a real case of cellular phones inspection is provided to illustrate the excellent ability of KAIG model in identifying rare defective products. In addition, KAIG model can reduce redundant test items and shorten inspection time. For imbalanced data, these applications show KAIG model can dramatically increase Negative Accuracy (the capability of detecting minor instances) without losing Overall Accuracy. 3. Propose IG based method to deal with continuous imbalanced data. In this method, different data characteristics and their combinations are employed to denote constructed IGs. Then we build a classifier from these representatives of IGs. An actual medical diagnosis data of diabetes is used to evaluate the effectiveness of this method. Compared with traditional techniques, the proposed method is shown to be superior for learning on imbalanced data.
author2 Su, Chao-Ton
author_facet Su, Chao-Ton
Chen, Long-Sheng
陳隆昇
author Chen, Long-Sheng
陳隆昇
spellingShingle Chen, Long-Sheng
陳隆昇
Granular computing for imbalanced data: theory and applications
author_sort Chen, Long-Sheng
title Granular computing for imbalanced data: theory and applications
title_short Granular computing for imbalanced data: theory and applications
title_full Granular computing for imbalanced data: theory and applications
title_fullStr Granular computing for imbalanced data: theory and applications
title_full_unstemmed Granular computing for imbalanced data: theory and applications
title_sort granular computing for imbalanced data: theory and applications
publishDate 2006
url http://ndltd.ncl.edu.tw/handle/83930875471713976916
work_keys_str_mv AT chenlongsheng granularcomputingforimbalanceddatatheoryandapplications
AT chénlóngshēng granularcomputingforimbalanceddatatheoryandapplications
AT chenlongsheng lìhuàjìsuànchùlǐbùpínghéngzīliàozhīlǐlùnyǔyīngyòng
AT chénlóngshēng lìhuàjìsuànchùlǐbùpínghéngzīliàozhīlǐlùnyǔyīngyòng
_version_ 1718293642363273216