[D] Comparison of techniques for handling missing data

BACHELOR Assignment

Comparison of techniques for handling missing data

Type: Bachelor CS 

If you are interested please contact :

Introduction:

Missing data is often an obstacle or at least a nasty problem in data science. "There are three main approaches to handle missing data:
* Imputation—where values are filled in the place of missing data,
* Omission—where samples with invalid data are discarded from further analysis, and
* Analysis—by directly applying methods unaffected by the missing values."
(Wikipedia: https://en.wikipedia.org/wiki/Missing_data)
Advances in probabilistic databases have the potential of providing for even better imputation. Probabilistic data is data that is uncertain, for example, a value could be "25 with a probability of 70% or 26 with a probability of 30%" [1].

Assignment:

The main objective of this project is to compare a novel technique based on probabilistic data with several other known techniques for imputation.
The idea is to take an existing good quality data set without missing data, delete various amounts of values of some attribute randomly, impute these missing values with different methods, and compare the quality of the results (by starting with a data set without missing values, we have the ground truth at our disposal for these quality measurements). A probabilistic imputation approach would be one where we not fill in a single crisp value, but a set of possible values with their probabilities, i.e., a probability distribution. The quality of the probabilistic data can be measured with expected precision and recall [1].

[1] van Keulen, M. (2018). Probabilistic Data Integration. In S. Sakr, & A. Zomaya (Eds.), Encyclopedia of Big Data Technologies Cham: Springer. https://doi.org/10.1007/978-3-319-63962-8_18-1
https://research.utwente.nl/en/publications/probabilistic-data-integration-3