ADA

Project duration:

July 2021 - June 2025

Applied data analysis

Project summary:

Day by day, a tremendous amount of data is produced. However, their quality often needs improvements, with typical problems such as missing data, systematic error, duplicate data, noisy data, and so on [1, 2, 3]. Missing data are absent values in datasets; in other words, missing data are values that have not been seen but might provide useful information for data analysis [4]. The presence of missing data may lead to wrong decisions [5]. The naive solution is to remove the recordings where a data sample is missing, in a listwise (full recordings only) or all values (pairwise available recordings) [6], but it may not always be reliable. Other approaches to handling missing data are imputation methods, ranging from traditional statistical approaches to methods based on machine learning [7]. Still, finding a suitable imputation method is costly, time-consuming and often based on brute-force trial-and-test strategies. The imputed missing values are further considered input in the learning algorithms for classification and prediction tasks to extract meaningful insights from the data. Overall, there is a crucial need to speed up the overall process of selecting the appropriate methods for imputation and classification/prediction which are most suitable to handle a specific dataset with missing values. A faster selection process may lead to improvements in many real-world applications, such as the internet of things (IoT) and smart cities, predictive analytics and intelligent decision-making, traffic prediction and transportation, cybersecurity and threat intelligence, healthcare and COVID-19 pandemic, e-commerce and product recommendations, sustainable agriculture, and so on [8].

PhD Aim: To obtain a cost-efficient selection process for imputation and machine learning methods best suitable for data sets containing missing data, by finding a relation between data (and their characteristics), imputation methods, and learning tasks.

References:

[1] Lin Li, Taoxin Peng, and Jessie Kennedy. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC), 1(2), 2014.

[2] Lydia Rabia, Idir Amine Amarouche, and Kadda Beghdad Bey. Rule-based approach for detecting dirty data in discharge summaries. In 2018 International Symposium on Programming and Systems (ISPS), pages 1–6. IEEE, 2018.

[3] Tabinda Sarwar, Sattar Seifollahi, Jeffrey Chan, Xiuzhen Zhang, Vural Aksakalli, Irene Hudson, Karin Verspoor, and Lawrence Cavedon. The secondary use of electronic health records for data mining: Data characteristics and challenges. ACM Computing Surveys (CSUR), 55(2):1–40, 2022.

[4] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.

[5] Irene Petersen, Catherine A Welch, Irwin Nazareth, Kate Walters, Louise Marston, Richard W Morris, James R Carpenter, Tim P Morris, and Tra My Pham. Health indicator recording in uk primary care electronic health records: key implications for handling missing data. Clinical epidemiology, 11:157, 2019.

[6] Judi Scheffer. Dealing with missing data. 2002.

[7] Sebastian J¨ager, Arndt Allhorn, and Felix Biesmann. A benchmark for data imputation methods. Frontiers in big Data, page 48, 2021.

[8] Iqbal H Sarker. Machine learning: Algorithms, real-world applications and research directions. SN Computer Science, 2(3):1–21, 2021.

Project Supervisors:

Funding:

Turkish Government