Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)

Introduction Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to incons...

Full description

Bibliographic Details
Main Authors: Anna Lin, Soon Song, Nancy Wang
Format: Article
Language:English
Published: Swansea University 2020-12-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1484
id doaj-b89eeb911bb14e538c770a779160f837
record_format Article
spelling doaj-b89eeb911bb14e538c770a779160f8372021-02-10T16:42:59ZengSwansea UniversityInternational Journal of Population Data Science2399-49082020-12-015510.23889/ijpds.v5i5.1484Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)Anna Lin0Soon Song1Nancy Wang2Statistics New ZealandFormerly worked at Statistics New ZealandFormerly worked at Statistics New Zealand Introduction Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency. Objectives and Approach A modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years. Results We have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high. Conclusion The automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations. https://ijpds.org/article/view/1484
collection DOAJ
language English
format Article
sources DOAJ
author Anna Lin
Soon Song
Nancy Wang
spellingShingle Anna Lin
Soon Song
Nancy Wang
Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
International Journal of Population Data Science
author_facet Anna Lin
Soon Song
Nancy Wang
author_sort Anna Lin
title Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_short Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_full Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_fullStr Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_full_unstemmed Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_sort using logistic regression to estimate the false positive rate in the idi (solinks)
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2020-12-01
description Introduction Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency. Objectives and Approach A modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years. Results We have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high. Conclusion The automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations.
url https://ijpds.org/article/view/1484
work_keys_str_mv AT annalin usinglogisticregressiontoestimatethefalsepositiverateintheidisolinks
AT soonsong usinglogisticregressiontoestimatethefalsepositiverateintheidisolinks
AT nancywang usinglogisticregressiontoestimatethefalsepositiverateintheidisolinks
_version_ 1724275201030488064