A generalized hierarchical approach for data labeling

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. === Thesis: S.M., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2019 === Cataloged from PDF version of th...

Full description

Bibliographic Details
Main Author:	Blanks, Zachary D.
Other Authors:	Troy M. Lau and Rahul Mazumder.
Format:	Others
Language:	English
Published:	Massachusetts Institute of Technology 2019
Subjects:	Operations Research Center.
Online Access:	https://hdl.handle.net/1721.1/122386

id	ndltd-MIT-oai-dspace.mit.edu-1721.1-122386
record_format	oai_dc
spelling	ndltd-MIT-oai-dspace.mit.edu-1721.1-1223862019-10-06T03:11:31Z A generalized hierarchical approach for data labeling Blanks, Zachary D. Troy M. Lau and Rahul Mazumder. Massachusetts Institute of Technology. Operations Research Center. Massachusetts Institute of Technology. Operations Research Center Operations Research Center. This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Thesis: S.M., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2019 Cataloged from PDF version of thesis. Includes bibliographical references (pages 85-90). The goal of this thesis was to develop a data type agnostic classification algorithm best suited for problems where there are a large number of similar labels (e.g., classifying a port versus a shipyard). The most common approach to this issue is to simply ignore it, and attempt to fit a classifier against all targets at once (a "flat" classifier). The problem with this technique is that it tends to do poorly due to label similarity. Conversely, there are other existing approaches, known as hierarchical classifiers (HCs), which propose clustering heuristics to group the labels. However, the most common HCs require that a "flat" model be trained a-priori before the label hierarchy can be learned. The primary issue with this approach is that if the initial estimator performs poorly then the resulting HC will have a similar rate of error. To solve these challenges, we propose three new approaches which learn the label hierarchy without training a model beforehand and one which generalizes the standard HC. The first technique employs a k-means clustering heuristic which groups classes into a specified number of partitions. The second method takes the previously developed heuristic and formulates it as a mixed integer program (MIP). Employing a MIP allows the user to have greater control over the resulting label hierarchy by imposing meaningful constraints. The third approach learns meta-classes by using community detection algorithms on graphs which simplifies the hyper-parameter space when training an HC. Finally, the standard HC methodology is generalized by relaxing the requirement that the original model must be a "flat" classifier; instead, one can provide any of the HC approaches detailed previously as the initializer. By giving the model a better starting point, the final estimator has a greater chance of yielding a lower error rate. To evaluate the performance of our methods, we tested them on a variety of data sets which contain a large number of similar labels. We observed the k-means clustering heuristic or community detection algorithm gave statistically significant improvements in out-of-sample performance against a flat and standard hierarchical classifier. Consequently our approach offers a solution to overcome problems for labeling data with similar classes. by Zachary D. Blanks. S.M. S.M. Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center 2019-10-04T21:31:31Z 2019-10-04T21:31:31Z 2019 2019 Thesis https://hdl.handle.net/1721.1/122386 1120104764 eng MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582 92 pages application/pdf Massachusetts Institute of Technology
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Operations Research Center.
spellingShingle	Operations Research Center. Blanks, Zachary D. A generalized hierarchical approach for data labeling
description	This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. === Thesis: S.M., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2019 === Cataloged from PDF version of thesis. === Includes bibliographical references (pages 85-90). === The goal of this thesis was to develop a data type agnostic classification algorithm best suited for problems where there are a large number of similar labels (e.g., classifying a port versus a shipyard). The most common approach to this issue is to simply ignore it, and attempt to fit a classifier against all targets at once (a "flat" classifier). The problem with this technique is that it tends to do poorly due to label similarity. Conversely, there are other existing approaches, known as hierarchical classifiers (HCs), which propose clustering heuristics to group the labels. However, the most common HCs require that a "flat" model be trained a-priori before the label hierarchy can be learned. The primary issue with this approach is that if the initial estimator performs poorly then the resulting HC will have a similar rate of error. === To solve these challenges, we propose three new approaches which learn the label hierarchy without training a model beforehand and one which generalizes the standard HC. The first technique employs a k-means clustering heuristic which groups classes into a specified number of partitions. The second method takes the previously developed heuristic and formulates it as a mixed integer program (MIP). Employing a MIP allows the user to have greater control over the resulting label hierarchy by imposing meaningful constraints. The third approach learns meta-classes by using community detection algorithms on graphs which simplifies the hyper-parameter space when training an HC. Finally, the standard HC methodology is generalized by relaxing the requirement that the original model must be a "flat" classifier; instead, one can provide any of the HC approaches detailed previously as the initializer. === By giving the model a better starting point, the final estimator has a greater chance of yielding a lower error rate. To evaluate the performance of our methods, we tested them on a variety of data sets which contain a large number of similar labels. We observed the k-means clustering heuristic or community detection algorithm gave statistically significant improvements in out-of-sample performance against a flat and standard hierarchical classifier. Consequently our approach offers a solution to overcome problems for labeling data with similar classes. === by Zachary D. Blanks. === S.M. === S.M. Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center
author2	Troy M. Lau and Rahul Mazumder.
author_facet	Troy M. Lau and Rahul Mazumder. Blanks, Zachary D.
author	Blanks, Zachary D.
author_sort	Blanks, Zachary D.
title	A generalized hierarchical approach for data labeling
title_short	A generalized hierarchical approach for data labeling
title_full	A generalized hierarchical approach for data labeling
title_fullStr	A generalized hierarchical approach for data labeling
title_full_unstemmed	A generalized hierarchical approach for data labeling
title_sort	generalized hierarchical approach for data labeling
publisher	Massachusetts Institute of Technology
publishDate	2019
url	https://hdl.handle.net/1721.1/122386
work_keys_str_mv	AT blankszacharyd ageneralizedhierarchicalapproachfordatalabeling AT blankszacharyd generalizedhierarchicalapproachfordatalabeling
_version_	1719261719161733120

A generalized hierarchical approach for data labeling

Similar Items