Clustering multivariate longitudinal observations: The contaminated Gaussian hidden Markov model

The Gaussian hidden Markov model (HMM) is widely considered for the analysis of heterogeneous continuous multivariate longitudinal data. To robustify this approach with respect to possible elliptical heavy-tailed departures from normality, due to the presence of outliers, spurious points, or noise (...

Full description

Bibliographic Details
Main Authors: Punzo, Antonio (Author), Maruotti, Antonello (Author)
Format: Article
Language:English
Published: 2015-09-29.
Subjects:
Online Access:Get fulltext
LEADER 02185 am a22001333u 4500
001 383292
042 |a dc 
100 1 0 |a Punzo, Antonio  |e author 
700 1 0 |a Maruotti, Antonello  |e author 
245 0 0 |a Clustering multivariate longitudinal observations: The contaminated Gaussian hidden Markov model 
260 |c 2015-09-29. 
856 |z Get fulltext  |u https://eprints.soton.ac.uk/383292/1/__userfiles.soton.ac.uk_Library_SLAs_Work_for_ALL%2527s_Work_for_ePrints_Accepted%2520Manuscripts_Punzo_Clustering.pdf 
520 |a The Gaussian hidden Markov model (HMM) is widely considered for the analysis of heterogeneous continuous multivariate longitudinal data. To robustify this approach with respect to possible elliptical heavy-tailed departures from normality, due to the presence of outliers, spurious points, or noise (collectively referred to as bad points herein), the contaminated Gaussian HMM is here introduced. The contaminated Gaussian distribution represents an elliptical generalization of the Gaussian distribution and allows for automatic detection of bad points in the same natural way as observations are typically assigned to the latent states in the HMM context. Once the model is fitted, each observation has a posterior probability of belonging to a particular state and, inside each state, of being a bad point or not. In addition to the parameters of the classical Gaussian HMM, for each state we have two more parameters, both with a specific and useful interpretation: one controls the proportion of bad points and one specifies their degree of atypicality. A sufficient condition for the identifiability of the model is given, an expectation-conditional maximization algorithm is outlined for parameter estimation and various operational issues are discussed. Using a large scale simulation study, but also an illustrative artificial dataset, we demonstrate the effectiveness of the proposed model in comparison with HMMs of different elliptical distributions, and we also evaluate the performance of some well-known information criteria in selecting the true number of latent states. The model is finally used to fit data on criminal activities in Italian provinces. 
655 7 |a Article