Race and ethnicity data for first, middle, and surnames

We provide the largest compiled publicly available dictionaries of first, middle, and surnames for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six U.S. Southern States that collect self-re...

Full description

Bibliographic Details
Main Authors: Imai, K. (Author), Olivella, S. (Author), Rosenman, E.T.R (Author)
Format: Article
Language:English
Published: NLM (Medline) 2023
Subjects:
Online Access:View Fulltext in Publisher
View in Scopus
LEADER 01974nam a2200265Ia 4500
001 10.1038-s41597-023-02202-2
008 230529s2023 CNT 000 0 und d
020 |a 20524463 (ISSN) 
245 1 0 |a Race and ethnicity data for first, middle, and surnames 
260 0 |b NLM (Medline)  |c 2023 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1038/s41597-023-02202-2 
856 |z View in Scopus  |u https://www.scopus.com/inward/record.uri?eid=2-s2.0-85159706181&doi=10.1038%2fs41597-023-02202-2&partnerID=40&md5=885a2b2f2b1abbd4ba0c7362d759c756 
520 3 |a We provide the largest compiled publicly available dictionaries of first, middle, and surnames for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six U.S. Southern States that collect self-reported racial data upon voter registration. Our data cover the racial make-up of a larger set of names than any comparable dataset, containing 136 thousand first names, 125 thousand middle names, and 338 thousand surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups - White, Black, Hispanic, Asian, and Other - and racial/ethnic probabilities by name are provided for every name in each dictionary. We provide both probabilities of the form ℙ(race|name) and ℙ(name|race), and conditions under which they can be assumed to be representative of a given target population. These conditional probabilities can then be deployed for imputation in a data analytic task for which self-reported racial and ethnic data is not available. © 2023. The Author(s). 
650 0 4 |a adult 
650 0 4 |a article 
650 0 4 |a ethnic group 
650 0 4 |a ethnicity 
650 0 4 |a Hispanic 
650 0 4 |a human 
650 0 4 |a probability 
650 0 4 |a race 
700 1 0 |a Imai, K.  |e author 
700 1 0 |a Olivella, S.  |e author 
700 1 0 |a Rosenman, E.T.R.  |e author 
773 |t Scientific data