Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization

The sample size is a fundamental problem in statistics, which also plays a very important role in data collection for big data scenario, especially in the characterization of data structure. This paper considers this problem from the perspective of message importance by transforming the sampling pro...

Full description

Bibliographic Details
Main Authors: Shanyun Liu, Rui She, Pingyi Fan
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8419705/
id doaj-bc00c06a6cfd450691ced37b938128a2
record_format Article
spelling doaj-bc00c06a6cfd450691ced37b938128a22021-03-29T21:06:47ZengIEEEIEEE Access2169-35362018-01-016428514286710.1109/ACCESS.2018.28593988419705Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure CharacterizationShanyun Liu0Rui She1Pingyi Fan2https://orcid.org/0000-0002-0658-6079Department of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, ChinaDepartment of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, ChinaDepartment of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, ChinaThe sample size is a fundamental problem in statistics, which also plays a very important role in data collection for big data scenario, especially in the characterization of data structure. This paper considers this problem from the perspective of message importance by transforming the sampling procedure into the process of collecting message importance. To this end, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable similar to differential entropy and calculate the DMIM for some common distributions. Based on DMIM, this paper proposes a new approach to the required sampling number, where the DMIM deviation is constructed to characterize the process of collecting message importance. In fact, the DMIM deviation is a new criterion to choose sample size to be large enough that the message importance of sample set differs from the whole message importance by no more than the specified amount. In order to visually display that the DMIM deviation can guarantee the statistical performance to some extent, we transformed the difference of message importance into the Kolmogorov-Smirnov statistic. Theoretical analyses and numerical results also demonstrate that the new approach is distribution-free and satisfies the Glivenko-Cantelli theorem, which agrees with the previous results in statistics. Moreover, the connection between message importance and distribution goodness-of-fit is established, which verifies that analyzing the data collection with taking message importance into account is feasible.https://ieeexplore.ieee.org/document/8419705/Differential message importance measurebig dataKolmogorov-Smirnov testgoodness-of-fitdistribution-free
collection DOAJ
language English
format Article
sources DOAJ
author Shanyun Liu
Rui She
Pingyi Fan
spellingShingle Shanyun Liu
Rui She
Pingyi Fan
Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
IEEE Access
Differential message importance measure
big data
Kolmogorov-Smirnov test
goodness-of-fit
distribution-free
author_facet Shanyun Liu
Rui She
Pingyi Fan
author_sort Shanyun Liu
title Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
title_short Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
title_full Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
title_fullStr Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
title_full_unstemmed Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
title_sort differential message importance measure: a new approach to the required sampling number in big data structure characterization
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2018-01-01
description The sample size is a fundamental problem in statistics, which also plays a very important role in data collection for big data scenario, especially in the characterization of data structure. This paper considers this problem from the perspective of message importance by transforming the sampling procedure into the process of collecting message importance. To this end, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable similar to differential entropy and calculate the DMIM for some common distributions. Based on DMIM, this paper proposes a new approach to the required sampling number, where the DMIM deviation is constructed to characterize the process of collecting message importance. In fact, the DMIM deviation is a new criterion to choose sample size to be large enough that the message importance of sample set differs from the whole message importance by no more than the specified amount. In order to visually display that the DMIM deviation can guarantee the statistical performance to some extent, we transformed the difference of message importance into the Kolmogorov-Smirnov statistic. Theoretical analyses and numerical results also demonstrate that the new approach is distribution-free and satisfies the Glivenko-Cantelli theorem, which agrees with the previous results in statistics. Moreover, the connection between message importance and distribution goodness-of-fit is established, which verifies that analyzing the data collection with taking message importance into account is feasible.
topic Differential message importance measure
big data
Kolmogorov-Smirnov test
goodness-of-fit
distribution-free
url https://ieeexplore.ieee.org/document/8419705/
work_keys_str_mv AT shanyunliu differentialmessageimportancemeasureanewapproachtotherequiredsamplingnumberinbigdatastructurecharacterization
AT ruishe differentialmessageimportancemeasureanewapproachtotherequiredsamplingnumberinbigdatastructurecharacterization
AT pingyifan differentialmessageimportancemeasureanewapproachtotherequiredsamplingnumberinbigdatastructurecharacterization
_version_ 1724193564715384832