A nonparametric multiple imputation approach for missing categorical data

Muhan Zhou, Yulei He, Mandi Yu, Chiu Hsieh Hsu

Research output: Research - peer-reviewArticle

Abstract

Background: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. Methods: We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results: The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions: We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

LanguageEnglish (US)
Article number87
JournalBMC Medical Research Methodology
Volume17
Issue number1
DOIs
StatePublished - Jun 6 2017

Fingerprint

Logistic Models
Calibration
Public Health
Weights and Measures

Keywords

  • Categorical data
  • Double robustness
  • Missing at Random
  • Multiple imputation
  • Nearest neighbour

ASJC Scopus subject areas

  • Epidemiology
  • Health Informatics

Cite this

A nonparametric multiple imputation approach for missing categorical data. / Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu Hsieh.

In: BMC Medical Research Methodology, Vol. 17, No. 1, 87, 06.06.2017.

Research output: Research - peer-reviewArticle

@article{f5e4bd7388214bfe94fb967194cdf2d3,
title = "A nonparametric multiple imputation approach for missing categorical data",
abstract = "Background: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. Methods: We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results: The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions: We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.",
keywords = "Categorical data, Double robustness, Missing at Random, Multiple imputation, Nearest neighbour",
author = "Muhan Zhou and Yulei He and Mandi Yu and Hsu, {Chiu Hsieh}",
year = "2017",
month = "6",
doi = "10.1186/s12874-017-0360-2",
volume = "17",
journal = "BMC Medical Research Methodology",
issn = "1471-2288",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - A nonparametric multiple imputation approach for missing categorical data

AU - Zhou,Muhan

AU - He,Yulei

AU - Yu,Mandi

AU - Hsu,Chiu Hsieh

PY - 2017/6/6

Y1 - 2017/6/6

N2 - Background: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. Methods: We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results: The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions: We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

AB - Background: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. Methods: We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results: The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions: We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

KW - Categorical data

KW - Double robustness

KW - Missing at Random

KW - Multiple imputation

KW - Nearest neighbour

UR - http://www.scopus.com/inward/record.url?scp=85020399741&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85020399741&partnerID=8YFLogxK

U2 - 10.1186/s12874-017-0360-2

DO - 10.1186/s12874-017-0360-2

M3 - Article

VL - 17

JO - BMC Medical Research Methodology

T2 - BMC Medical Research Methodology

JF - BMC Medical Research Methodology

SN - 1471-2288

IS - 1

M1 - 87

ER -