Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models

abstract: This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-ad...

Full description

Bibliographic Details
Other Authors: Valdivia, Arturo (Author)
Format: Doctoral Thesis
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/2286/R.I.20819
id ndltd-asu.edu-item-20819
record_format oai_dc
spelling ndltd-asu.edu-item-208192018-06-22T03:04:28Z Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models abstract: This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions. Dissertation/Thesis Valdivia, Arturo (Author) Eubank, Randall (Advisor) Young, Dennis (Committee member) Reiser, Mark (Committee member) Kao, Ming-Hung (Committee member) Broatch, Jennifer (Committee member) Arizona State University (Publisher) Statistics Data Mining Interactions Random Forest Statistical Learning Value Added Models Variable Importance eng 209 pages Ph.D. Statistics 2013 Doctoral Dissertation http://hdl.handle.net/2286/R.I.20819 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2013
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Statistics
Data Mining
Interactions
Random Forest
Statistical Learning
Value Added Models
Variable Importance
spellingShingle Statistics
Data Mining
Interactions
Random Forest
Statistical Learning
Value Added Models
Variable Importance
Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models
description abstract: This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions. === Dissertation/Thesis === Ph.D. Statistics 2013
author2 Valdivia, Arturo (Author)
author_facet Valdivia, Arturo (Author)
title Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models
title_short Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models
title_full Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models
title_fullStr Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models
title_full_unstemmed Alternative Methods via Random Forest to Identify Interactions in a General Framework and Variable Importance in the Context of Value-Added Models
title_sort alternative methods via random forest to identify interactions in a general framework and variable importance in the context of value-added models
publishDate 2013
url http://hdl.handle.net/2286/R.I.20819
_version_ 1718700233970417664