Automatic coding of occupation and cause-of-death records

The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analys...

Full description

Bibliographic Details
Main Authors: Richard Tobin, Elaine Farrow, Claire Grover, Beatrice Alex
Format: Article
Language:English
Published: Swansea University 2019-11-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1202
id doaj-3811cf2247e84aa595ff190c0513b413
record_format Article
spelling doaj-3811cf2247e84aa595ff190c0513b4132020-11-25T02:14:03ZengSwansea UniversityInternational Journal of Population Data Science2399-49082019-11-014310.23889/ijpds.v4i3.1202Automatic coding of occupation and cause-of-death recordsRichard Tobin0Elaine Farrow1Claire Grover2Beatrice Alex3The University of EdinburghThe University of EdinburghThe University of EdinburghThe University of Edinburgh The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysis. The digitised birth, marriage, and death certificates include textual descriptions of occupations and causes of death. Our aim is to map these descriptions to standard HISCO and ICD-10 codes. It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques. A proportion of the records will be manually coded and used to train the system. More recent records are already coded and these can also be used for training. Following earlier work by [Kirby et al] and [Carson et al] we are experimenting with Bayesian classifiers for this task. By combining exact matching for texts that have been seen in the training data and Bayes for the rest, we get an accuracy in cross-validation of 92% for causes of death and 94-97% for occupations. We are investigating methods to improve this, including automatic spelling correction and synonym detection, use of age and sex information, and (for causes of death) the presence of co-occurring causes. We are also investigating the value of coarser-grained but more reliable coding, and reporting second- and third-choice codes. This is work in progress, and the final paper will consider whether the improvements we are making are sufficient to produce useful data for further research. We will also make recommendations about further manual annotation to provide training data covering the whole timespan of the records. https://ijpds.org/article/view/1202
collection DOAJ
language English
format Article
sources DOAJ
author Richard Tobin
Elaine Farrow
Claire Grover
Beatrice Alex
spellingShingle Richard Tobin
Elaine Farrow
Claire Grover
Beatrice Alex
Automatic coding of occupation and cause-of-death records
International Journal of Population Data Science
author_facet Richard Tobin
Elaine Farrow
Claire Grover
Beatrice Alex
author_sort Richard Tobin
title Automatic coding of occupation and cause-of-death records
title_short Automatic coding of occupation and cause-of-death records
title_full Automatic coding of occupation and cause-of-death records
title_fullStr Automatic coding of occupation and cause-of-death records
title_full_unstemmed Automatic coding of occupation and cause-of-death records
title_sort automatic coding of occupation and cause-of-death records
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2019-11-01
description The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysis. The digitised birth, marriage, and death certificates include textual descriptions of occupations and causes of death. Our aim is to map these descriptions to standard HISCO and ICD-10 codes. It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques. A proportion of the records will be manually coded and used to train the system. More recent records are already coded and these can also be used for training. Following earlier work by [Kirby et al] and [Carson et al] we are experimenting with Bayesian classifiers for this task. By combining exact matching for texts that have been seen in the training data and Bayes for the rest, we get an accuracy in cross-validation of 92% for causes of death and 94-97% for occupations. We are investigating methods to improve this, including automatic spelling correction and synonym detection, use of age and sex information, and (for causes of death) the presence of co-occurring causes. We are also investigating the value of coarser-grained but more reliable coding, and reporting second- and third-choice codes. This is work in progress, and the final paper will consider whether the improvements we are making are sufficient to produce useful data for further research. We will also make recommendations about further manual annotation to provide training data covering the whole timespan of the records.
url https://ijpds.org/article/view/1202
work_keys_str_mv AT richardtobin automaticcodingofoccupationandcauseofdeathrecords
AT elainefarrow automaticcodingofoccupationandcauseofdeathrecords
AT clairegrover automaticcodingofoccupationandcauseofdeathrecords
AT beatricealex automaticcodingofoccupationandcauseofdeathrecords
_version_ 1724902327100375040