Introduction to Missing Data

Aaron Yang
3 min readJan 29, 2023

--

There are three main mechanisms for missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

  1. Missing completely at random (MCAR) means that the missing data are not related to any other variable in the study. For example, if a survey has a question that is skipped by some respondents purely by chance, the missing data would be MCAR. P(Y is missing | X, Y) = P(Y is missing).
  2. Missing at random (MAR) means that the missing data are related to other variables in the study, but not to the variable of interest. For example, if a survey has a question about income that is only skipped by people who are unemployed, the missing data would be MAR. P(Y is missing | X, Y) = P(Y is missing | X).
  3. Missing not at random (MNAR) means that the missing data are related to the variable of interest. For example, if a survey has a question about a sensitive topic that is only skipped by people who have a certain opinion on the topic, the missing data would be MNAR.

In general, missing data mechanisms can be difficult to determine and may require additional analysis to identify (sensitivity analysis). It is important to carefully consider the potential mechanisms for missing data when designing a study and analyzing the results.

Last Observation Carried Forward method for missing data imputation

The Last Observation Carried Forward (LOCF) method is a simple approach to dealing with missing data in a time series analysis. The idea behind LOCF is to use the last known value for a given variable to replace any missing values that occur after that value. For example, if we have a series of measurements of a person’s weight taken at regular intervals, and one of the measurements is missing, we can use the last known weight measurement to impute the missing value.

The main limitation of the LOCF method is that it assumes that the missing value is not significantly different from the last known value. This may not always be the case, particularly if there is a trend or other patterns in the data that are not captured by the last known value. In such cases, using the last known value to impute the missing data can lead to biased or inaccurate results. Additionally, the LOCF method does not take into account any other information that may be available, such as the person’s age or activity level, which could affect their weight.

Single Imputation

Single imputation is a statistical technique used to replace missing data values with a single estimate. The aim is to fill in the missing values with a representative value, based on the available information. Common methods for single imputation include mean imputation, median imputation, and most frequent value imputation (mode).

Single imputation can introduce bias into the analysis and may not accurately reflect the uncertainty in the data.

Multiple Imputation

Method for averaging the outcomes across multiple imputed data sets to account for the uncertainty in the imputation. Similar to single imputation, missing values are imputed m times to create m datasets with complete data. Then, the analysis is conducted on each m dataset, leading to m analyses. Finally, we consolidate the m results into one result by calculating the mean, variance, and confidence intervals of the parameter estimates for the variables of concern. Overall, multiple imputation takes into account the uncertainty in the imputation process.

What is the advantage of multiple imputation versus single imputation methods?

Multiple imputation is a statistical technique that is used to handle missing data in a dataset. It involves creating multiple versions of the dataset, each of which has missing values filled in using different assumptions or models. The advantage of multiple imputation compared to single imputation methods is that it can provide more accurate and reliable estimates of the missing data. With single imputation, only one estimate of the missing data is produced, which may not be representative of the true underlying value. By contrast, with multiple imputation, multiple estimates are produced, which can be combined using statistical techniques to produce more accurate and reliable results. Additionally, multiple imputation can provide additional information about the uncertainty associated with the estimates of the missing data, which can be useful for making inferences and decisions based on the data.

--

--

Aaron Yang
Aaron Yang

Written by Aaron Yang

I am a second-year biostatistics PhD candidate and work the GRA at MD Anderson.

No responses yet