Data Imputation Meaning
Data imputation is the process of filling up the missing data with substitute values in a dataset. Many data sets are improper and incomplete in terms of data points and values, and imputation helps in retaining the maximum size of the dataset.
There are multiple techniques of data imputation. The whole technique is based on the concept that it is important to impute data because it is neither feasible nor practical to remove data from the data set each time as it would reduce the dataset’s size, eventually inducing concerns about data quality, bias and impairing analysis.
Table of contents
- Data imputation is the technique of retaining the majority of information or data by putting substitute values in the missing data.
- Statisticians do it to restrict the dataset from a large reduction and prevent impairing analysis and bias.
- There are two types of missing data imputation: single and multiple data imputation. The former offers a single value; the latter provides a set of responses.
- Data imputation follows certain rules, and one cannot apply it without considering certain factors and aspects of the data.
Data Imputation Explained
Data imputation is the technique of substituting values in a data set in places where the data is missing. In most cases, when one collects or gathers data, there are certain factors associated with it, such as a degree of error, a distortion level, variance and missing values. When the data is not complete and is lacking data points, it makes it challenging for an analyst to continue their calculations because there is a higher probability of inaccuracy in the outcome.
To prevent this, missing data imputation is introduced, which basically estimates the fair value that can replace the missing values by analyzing the data and completing the dataset. By doing so, not only does the data become intact, but also ready to properly fit into a model for further analysis. It is important because it prevents data size reduction and restricts impairing analysis.
There are basically two data imputation methods: single and multiple; the former is less complex and offers a specific number in place of the missing data, whereas the latter uses simulation models to offer a set of possible responses. As part of modern calculation, there are R and Python, along with other software tools that can perform data imputation, such as SAS, Stata and SPSS, which also helps in analyzing the pooled imputed datasets.
There are several data imputation methods and techniques, but each one of them depends on the calculation and analysis model used:
- Next or previous value – This is one of the most common imputation techniques in one substitutes the previous or next value on the missing value spot inside the time series or data.
- K nearest neighbors – In this method, k is the nearest example in the data where the value in the relevant feature is present. It is then substituted with the value of the feature that frequently occurs in the group.
- Minimum or maximum value – If the researcher is aware of the specific range that the dataset must fit in, they can use either the maximum or minimum value of the data to impute in the missing values.
- Missing value prediction – Using a single imputation, one can use a machine learning model to determine the final imputation value, and they can train the model with values present in the other columns.
- Most frequent value – In this method, one assesses the whole dataset and uses the most frequent value to impute missing values.
- Average or linear interpolation – This method is similar to the previous or next-value technique. However, we only use it for numerical data for which we sort the dataset in advance.
- Mean, median or moving average value – In statistics, the median, mean or rounded mean are popular imputation values because they indirectly represent the data. However, the median is more used than the mean when a dataset has multiple outliers.
- Fixed value – It is a universally popular method that replaces the null data with a fixed value and is applicable to all data types.
Below are two examples of data imputation –
Suppose there is a class of 18 students with different heights in a series that follows –
16,_, 20, 18,15, 17, 14, 12, _ ,18,19, 20, 16, 12, 18, _ ,18, 14
Now, in this series, there are three missing values. The researcher can apply different imputation methods if they follow the model of most frequent value, which is clearly 18.
If the researcher moves forward with this value, he will impute the missing values with the most frequent value, making the series as –
16,18, 20, 18, 15, 17, 14, 12, 18, 18, 19, 20, 16, 12, 18, 18, 18, 14
Again, this is a simple data imputation example. In real-world scenarios, the researcher may take another approach and consider other analysis models and factors.
In another example, suppose a researcher comes across a time series data with a total of nine data values, but three are missing in between them; the series follows as follows –
2, 4, _ ,6, 7, _ , 8, 9, _
Now, the easiest way for the researcher to impute is to replace each missing value with the mean of the observed values for that variable –
2 + 4 + 6 + 7 + 8 + 9 = 36
36/6 = 6
The researcher would easily replace the missing value by 6, making the series –
2, 4, 6, 6, 7, 6, 8, 9, 6
This method is also known as the mean, median and moving average technique. Both the examples are very straightforward, but in reality, the datasets are quite large, and the analysis models are complex and rigid.
The applications of data imputation are –
- One can apply the data imputation to complete the time series data set.
- It finds use in finance to record the price movement when a price cap has been reached. And then, one can substitute the missing prices with the asset’s minimum value.
- It is employed to keep the data intact without any missing values that cause bias.
- For any form of study, the missing value in the dataset leads to distortion in the model and analysis and outcomes; the data imputation core application is to prevent it.
- One can apply it with different analysis models using Python, R, machine learning and SPSS to deduce complex statistical analysis that we cannot otherwise calculate it using traditional methods.
Advantages And Disadvantages
The advantages of data imputation are –
- Imputes relevant values in replacing the missing values in a dataset through estimation and substitution techniques.
- Prevents data and analysis from losing valuable information.
- Improves the performance and efficiency of machine learning models.
- Minimizes the data bias and helps in restricting impairing analysis.
- Helps in data preservation and does not introduce external variability.
The disadvantages of data imputation are –
- The more the values are missing, the more distortion will be present in the dataset.
- Data imputation, if not processed correctly, can disturb the original variable distribution.
- Depending on the model, over-representation of a particular category can be observed.
- Some imputation techniques are expensive and have computational complexity.
Frequently Asked Questions (FAQs)
The general rules of data imputation are –
– If the data is missing completely at random (MCAR), it does not require imputation.
– We must not apply imputation to all the data for a subject.
– The process should not be initiated if more than 50% of the data is missing; some researchers may use a cutoff of 20%.
– If the data is missing not at random (MNAR), imputation shall not be performed.
– If the data generates values outside the valid ranges, then imputation can be applied.
Both data imputation and removal of data are opposite to each other; the former is used to replace the missing values with reasonable substitutes and estimation, but the latter is simply the process of removing values from the dataset. Additionally, imputation is performed so that the dataset does not get decimated, whereas data removal completes that objective.
From an imputation perspective, a dataset is acceptable. It can be ignored if less than 5% of data is missing from it, but when it goes beyond 10%, the dataset is more likely to introduce bias and requires handling. In such cases, the data imputation is employed to complete the data.
This article has been a guide to Data Imputation and its meaning. Here, we explain its techniques, applications, examples, advantages, and disadvantages. You may also find some useful articles here –