Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis

We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural de...

Full description

Bibliographic Details
Main Authors: Fushing Hsieh, Elizabeth P. Chou, Ting-Li Chen
Format: Article
Language:English
Published: MDPI AG 2021-05-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/23/5/594
id doaj-4d98c96f09f84b40ad6859411f78e586
record_format Article
spelling doaj-4d98c96f09f84b40ad6859411f78e5862021-05-31T23:42:26ZengMDPI AGEntropy1099-43002021-05-012359459410.3390/e23050594Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data AnalysisFushing Hsieh0Elizabeth P. Chou1Ting-Li Chen2Department of Statistics, University of California at Davis, Davis, CA 95616, USADepartment of Statistics, National Chengchi University, Taibei 116, TaiwanInstitute of Statistical Science, Academia Sinica, Taipei 115, TaiwanWe develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. CEDA is developed upon all features’ categorical nature via histogram and it is guided by all features’ associative patterns (order-2 dependence) in a mutual conditional entropy matrix. Higher-order structural dependency of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>k</mi><mo>(</mo><mo>≥</mo><mn>3</mn><mo>)</mo></mrow></semantics></math></inline-formula> features is exhibited through block patterns within heatmaps that are constructed by permuting contingency-<i>k</i>D-lattices of counts. By growing <i>k</i>, the resultant heatmap series contains global and large scales of structural dependency that constitute the data matrix’s information content. When involving continuous features, the principal component analysis (PCA) extracts fine-scale information content from each block in the final heatmap. Our mimicking protocol coherently simulates this heatmap series by preserving global-to-fine scales structural dependency. Upon every step of mimicking process, each accepted simulated heatmap is subject to constraints with respect to all of the reliable observed categorical patterns. For reliability and robustness in sciences, CEDA with mimicking enhances data visualization by revealing deterministic and stochastic structures within each scale-specific structural dependency. For inferences in Machine Learning (ML) and Statistics, it clarifies, upon which scales, which covariate feature-groups have major-vs.-minor predictive powers on response features. For the social justice of Artificial Intelligence (AI) products, it checks whether a data matrix incompletely prescribes the targeted system.https://www.mdpi.com/1099-4300/23/5/594contingency-kD-latticehigh order structural dependencyheterogeneitymutual conditional entropy matrixprincipal component analysis (PCA)
collection DOAJ
language English
format Article
sources DOAJ
author Fushing Hsieh
Elizabeth P. Chou
Ting-Li Chen
spellingShingle Fushing Hsieh
Elizabeth P. Chou
Ting-Li Chen
Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
Entropy
contingency-kD-lattice
high order structural dependency
heterogeneity
mutual conditional entropy matrix
principal component analysis (PCA)
author_facet Fushing Hsieh
Elizabeth P. Chou
Ting-Li Chen
author_sort Fushing Hsieh
title Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
title_short Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
title_full Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
title_fullStr Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
title_full_unstemmed Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
title_sort mimicking complexity of structured data matrix’s information content: categorical exploratory data analysis
publisher MDPI AG
series Entropy
issn 1099-4300
publishDate 2021-05-01
description We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. CEDA is developed upon all features’ categorical nature via histogram and it is guided by all features’ associative patterns (order-2 dependence) in a mutual conditional entropy matrix. Higher-order structural dependency of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>k</mi><mo>(</mo><mo>≥</mo><mn>3</mn><mo>)</mo></mrow></semantics></math></inline-formula> features is exhibited through block patterns within heatmaps that are constructed by permuting contingency-<i>k</i>D-lattices of counts. By growing <i>k</i>, the resultant heatmap series contains global and large scales of structural dependency that constitute the data matrix’s information content. When involving continuous features, the principal component analysis (PCA) extracts fine-scale information content from each block in the final heatmap. Our mimicking protocol coherently simulates this heatmap series by preserving global-to-fine scales structural dependency. Upon every step of mimicking process, each accepted simulated heatmap is subject to constraints with respect to all of the reliable observed categorical patterns. For reliability and robustness in sciences, CEDA with mimicking enhances data visualization by revealing deterministic and stochastic structures within each scale-specific structural dependency. For inferences in Machine Learning (ML) and Statistics, it clarifies, upon which scales, which covariate feature-groups have major-vs.-minor predictive powers on response features. For the social justice of Artificial Intelligence (AI) products, it checks whether a data matrix incompletely prescribes the targeted system.
topic contingency-kD-lattice
high order structural dependency
heterogeneity
mutual conditional entropy matrix
principal component analysis (PCA)
url https://www.mdpi.com/1099-4300/23/5/594
work_keys_str_mv AT fushinghsieh mimickingcomplexityofstructureddatamatrixsinformationcontentcategoricalexploratorydataanalysis
AT elizabethpchou mimickingcomplexityofstructureddatamatrixsinformationcontentcategoricalexploratorydataanalysis
AT tinglichen mimickingcomplexityofstructureddatamatrixsinformationcontentcategoricalexploratorydataanalysis
_version_ 1721416817264558080