Quantile regression for zero-inflated outcomes

Zero-inflated outcomes are common in biomedical studies, where the excessive zeros indicate some special but undetectable events. Quantile regression is potentially advantageous in analyzing zero-inflated outcomes due to two reasons. First, compared to parametric models such as the zero-inflated Poi...

Full description

Bibliographic Details
Main Author: Ling, Wodan
Language:English
Published: 2019
Subjects:
Online Access:https://doi.org/10.7916/d8-rre7-sw52
id ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-rre7-sw52
record_format oai_dc
collection NDLTD
language English
sources NDLTD
topic Biometry
Quantile regression
Mathematical models
Distribution (Probability theory)
spellingShingle Biometry
Quantile regression
Mathematical models
Distribution (Probability theory)
Ling, Wodan
Quantile regression for zero-inflated outcomes
description Zero-inflated outcomes are common in biomedical studies, where the excessive zeros indicate some special but undetectable events. Quantile regression is potentially advantageous in analyzing zero-inflated outcomes due to two reasons. First, compared to parametric models such as the zero-inflated Poisson and two-part model, quantile regression gives robust and accurate estimation by avoiding likelihood specification and can capture the tail events and heterogeneity over the outcome distribution. Second, while the mean-based regression may be misinterpreted for a zero-inflated outcome, the interpretation of quantiles is naturally compatible with the underlying process that such an outcome intends to measure. Unfortunately, uncorrected linear quantile regression is not directly applicable because of two reasons. First, the feasibility of estimation and validity of inference of quantile regression require the conditional distribution of outcomes to be absolutely continuous, which is violated due to zero-inflation. Second, direct quantile regression implicitly assumes a constant chance to observe a positive outcome, but the degree of zero-inflation varies with the covariates in most cases. Thus the conditional quantile function of the outcome depends on the covariates in a nonlinear fashion. To analyze the zero-inflated outcomes by taking advantage of the merits of quantile regression, we propose a novel quantile regression framework that can address all the issues above. In the first part of this dissertation, we propose a two-part model that comprises a logistic regression for the probability of being positive, and a linear quantile regression for the positive part with subject-specific zero-inflation adjusted. Inference on the estimated conditional quantile and covariate effect are not trivial based on such a two-part model. We then develop an algorithm to achieve a consistent estimation of the conditional quantiles, while circumventing the unbounded variance at the quantile level where the conditional quantile changes from zero to positive. Furthermore, we develop an inference tool to determine the quantile treatment effect associated with a covariate at a given quantile level. We evaluate the proposed method and compare it with existing approaches by simulation studies and a real data analysis aimed at studying the risk factors for carotid atherosclerosis. In the second part, based on the proposed two-part model mentioned above, we develop ZIQRank, a zero-inflated quantile rank-score based test to detect the difference in distributions. The proposed test extends the local inference in the first part to a simultaneous one. It is powerful to handle zero-inflation and heterogeneity simultaneously. It comprises a valid test of logistic regression for the zero-inflation and rank-score based tests on multiple quantiles for the positive part with zero-inflation adjusted. The p-values are combined with a procedure selected according to the extent of zero-inflation and heterogeneity of the data. Simulation studies show that compared to existing tests, the proposed test has a higher power in detecting differential distributions. Finally, we apply the ZIQRank test to a human scRNA-seq data to study differentially expressed genes in Neoplastic and Regular cells. It successfully discovers a group of crucial genes associated with glioma, while the other methods fail to do so. In the third part, we extend the proposed two-part quantile regression model for zero-inflated outcomes and the ZIQRank test to analyze longitudinal data. Each part of the proposed two-part model is modified as a marginal longitudinal model (GEE), conditioning on the outcome at the previous time point and its zero/positive status. We apply the model and the test to study the effect of a recommender system aimed at boosting user engagement of a suite of smartphone apps designed for depressed patients. Our novel model framework demonstrates a dominating performance in model fitting, prediction, and critical feature detection, compared to the existing methods.
author Ling, Wodan
author_facet Ling, Wodan
author_sort Ling, Wodan
title Quantile regression for zero-inflated outcomes
title_short Quantile regression for zero-inflated outcomes
title_full Quantile regression for zero-inflated outcomes
title_fullStr Quantile regression for zero-inflated outcomes
title_full_unstemmed Quantile regression for zero-inflated outcomes
title_sort quantile regression for zero-inflated outcomes
publishDate 2019
url https://doi.org/10.7916/d8-rre7-sw52
work_keys_str_mv AT lingwodan quantileregressionforzeroinflatedoutcomes
_version_ 1719269888039583744
spelling ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-rre7-sw522019-10-17T03:18:17ZQuantile regression for zero-inflated outcomesLing, Wodan2019ThesesBiometryQuantile regressionMathematical modelsDistribution (Probability theory)Zero-inflated outcomes are common in biomedical studies, where the excessive zeros indicate some special but undetectable events. Quantile regression is potentially advantageous in analyzing zero-inflated outcomes due to two reasons. First, compared to parametric models such as the zero-inflated Poisson and two-part model, quantile regression gives robust and accurate estimation by avoiding likelihood specification and can capture the tail events and heterogeneity over the outcome distribution. Second, while the mean-based regression may be misinterpreted for a zero-inflated outcome, the interpretation of quantiles is naturally compatible with the underlying process that such an outcome intends to measure. Unfortunately, uncorrected linear quantile regression is not directly applicable because of two reasons. First, the feasibility of estimation and validity of inference of quantile regression require the conditional distribution of outcomes to be absolutely continuous, which is violated due to zero-inflation. Second, direct quantile regression implicitly assumes a constant chance to observe a positive outcome, but the degree of zero-inflation varies with the covariates in most cases. Thus the conditional quantile function of the outcome depends on the covariates in a nonlinear fashion. To analyze the zero-inflated outcomes by taking advantage of the merits of quantile regression, we propose a novel quantile regression framework that can address all the issues above. In the first part of this dissertation, we propose a two-part model that comprises a logistic regression for the probability of being positive, and a linear quantile regression for the positive part with subject-specific zero-inflation adjusted. Inference on the estimated conditional quantile and covariate effect are not trivial based on such a two-part model. We then develop an algorithm to achieve a consistent estimation of the conditional quantiles, while circumventing the unbounded variance at the quantile level where the conditional quantile changes from zero to positive. Furthermore, we develop an inference tool to determine the quantile treatment effect associated with a covariate at a given quantile level. We evaluate the proposed method and compare it with existing approaches by simulation studies and a real data analysis aimed at studying the risk factors for carotid atherosclerosis. In the second part, based on the proposed two-part model mentioned above, we develop ZIQRank, a zero-inflated quantile rank-score based test to detect the difference in distributions. The proposed test extends the local inference in the first part to a simultaneous one. It is powerful to handle zero-inflation and heterogeneity simultaneously. It comprises a valid test of logistic regression for the zero-inflation and rank-score based tests on multiple quantiles for the positive part with zero-inflation adjusted. The p-values are combined with a procedure selected according to the extent of zero-inflation and heterogeneity of the data. Simulation studies show that compared to existing tests, the proposed test has a higher power in detecting differential distributions. Finally, we apply the ZIQRank test to a human scRNA-seq data to study differentially expressed genes in Neoplastic and Regular cells. It successfully discovers a group of crucial genes associated with glioma, while the other methods fail to do so. In the third part, we extend the proposed two-part quantile regression model for zero-inflated outcomes and the ZIQRank test to analyze longitudinal data. Each part of the proposed two-part model is modified as a marginal longitudinal model (GEE), conditioning on the outcome at the previous time point and its zero/positive status. We apply the model and the test to study the effect of a recommender system aimed at boosting user engagement of a suite of smartphone apps designed for depressed patients. Our novel model framework demonstrates a dominating performance in model fitting, prediction, and critical feature detection, compared to the existing methods.Englishhttps://doi.org/10.7916/d8-rre7-sw52